Address review comments

nicolasvasilache · nicolasvasilache · commit 37bf9bc663f0 · 2018-04-03T09:55:59.000-06:00
diff --git a/CodingConventions.md b/CodingConventions.md
@@ -594,4 +594,4 @@ Coding Conventions for writing Tensor Comprehensions
 
 Please see the following documentation
 [entry](https://facebookresearch.github.io/TensorComprehensions/coding_conventions.html)
-on how to write Tensor Comprehensions in a standard, legible, fashion.
+on how to write Tensor Comprehensions in a standard legible fashion.
diff --git a/docs/doxygen/index.md b/docs/doxygen/index.md
@@ -28,7 +28,7 @@ Let's start with a simple example is a matrix vector product:
 `A` and `x` are input tensors. `o` is an output tensor.
 The statement `o(r) +=! A(r,r_c) * x(r_c)` introduces two index variables `r` and `r_c`.
 Their range is inferred by their use indexing `A` and `x`. `r = [0,R)`, `r_c = [0,C)`.
-Because `r_c` only appears on the right side,
+Because `r_c` only appears on the righthand side,
 stores into `o` will reduce over `r_c` with the reduction specified for the loop.
 Reductions can occur across multiple variables, but they all share the same kind of associative reduction (e.g. +=)
 to maintain invariant (3). `mv` computes the same thing as this C++ loop:
diff --git a/docs/source/coding_conventions.rst b/docs/source/coding_conventions.rst
@@ -3,7 +3,7 @@ Coding Conventions
 
 In order to increase readability across Tensor Comprehensions written by
 multiple authors and to reduce the amount of surprising behavior, the
-following conventions should be adopted when writing TC. Generally in TC one
+following conventions should be adopted when writing TC. Generally in TC, one
 should increment nesting by 4 whitespaces at each level and align tensor names
 and indices where appropriate to make memory access patterns emerge. Since
 these two goals can easily be conflicting, use your best judgement to tradeoff
@@ -12,7 +12,7 @@ between the two goals. Such examples are provided below.
 Use indices named after parameters
 ----------------------------------
 
-Use upper-case names for parameters and input/output tensors.
+Use upper-case names for parameters and capital-case names for input/output tensors.
 Use lower-case names for indices to match the name of the parameter
 corresponding to the dimension upon which they iterate.
 In other words, prefer:
@@ -55,46 +55,47 @@ to:
         C(m, n) +=! A(m, k) * B(k, n)
     }
 
-Filter non-rectangular regions with deta-dependencies
+Filter non-rectangular regions with data-dependencies
 -----------------------------------------------------
 
-TC semantics only support (hyper-)rectangular iteration spaces. This is a hard
-requirement to make range inference non-ambiguous. To simulate non-rectangular
-iteration spaces, one can use the following:
+TC semantics are restricted to (hyper-)rectangular iteration spaces.
+This is a hard requirement to ensure range inference is non-ambiguous (see inference_).
+To simulate non-rectangular iteration spaces, one can use the following:
 
 .. code::
 
     def matmul(float(M, K) L, float(K, M) U) -> (LU) {
         LU(m1, m2) +=! (r_k >= m1 and r_k =< m2) ? L(m1, r_k) * U(r_k, m2) : 0
     }
 
-However, the following is incompatible with range inference and will fail
-the semantic checks in the TC compiler:
+However, non-(hyper)-rectangular iteration spaces (e.g. triangular) are
+incompatible with range inference and will fail the semantic checks in the TC
+compiler:
 
 .. code::
 
     def matmul(float(M, K) L, float(K, M) U) -> (LU) {
         LU(m1, m2) +=! L(m1, r_k) * U(r_k, m2) where r_k in m1:M, r_k in 0:m2+1
     }
 
-The reader may remark that this is an inefficient way of performing
+The reader may remark that this is an inefficient way of writing
 matrix-multiplication of triangular matrices.
-Lowering such operations efficient from TC is the subject of future work.
+Lowering such operations efficiently from TC is the subject of future work.
 
-Prefix gradient tensors names with :code:`g_`
+Prefix gradient tensors names with :code:`d_`
 ---------------------------------------------
 
 When implementing backward operations, pass the inputs to the backwards pass
-in the same order as the outputs to the forward passs and use the same tensor
-name prefixed by :code:`g_`. For instance:
+in the same order as the outputs of the forward pass and use the same tensor
+name prefixed by :code:`d_`. For instance:
 
 .. code::
 
      def conv(float(N,C,H,W) I, float(M,C,KH,KW) Wt) -> (O) {
          ...
      }
 
-     def conv_bw(float(N,C,H,W) I, float(M,C,KH,KW) Wt, float(N,M,HO,WO) g_O) -> (g_I) {
+     def conv_bw(float(N,C,H,W) I, float(M,C,KH,KW) Wt, float(N,M,HO,WO) d_O) -> (d_I) {
          ...
      }
 
@@ -110,9 +111,9 @@ and the emergence of an antidiagonal pattern in the reduction accesses:
     def matmul(float(M,K) A, float(K,N) B) -> (C) {
         C(m, n) +=! A(m, r_k) * B(r_k, n)
     }
-    def matmul_bw(float(M,K) A, float(K,N) B, float(M,N) g_C) -> (g_A, g_B){
-        g_A(m, k) +=! g_C(  m, r_n) * B(  k, r_n)
-        g_B(k, n) +=! g_C(r_m,   n) * A(r_m,   k)
+    def matmul_bw(float(M,K) A, float(K,N) B, float(M,N) d_C) -> (d_A, d_B){
+        d_A(m, k) +=! d_C(  m, r_n) * B(  k, r_n)
+        d_B(k, n) +=! d_C(r_m,   n) * A(r_m,   k)
     }
 
 Reasoning on such reduction patterns at the level of TC has already proven
diff --git a/docs/source/framework/pytorch_integration/autograd_with_tc.rst b/docs/source/framework/pytorch_integration/autograd_with_tc.rst
@@ -29,9 +29,9 @@ Examples
      def convolution(float(N,C,H,W) I, float(M,C,KH,KW) W1) -> (O) {{
        O(n, m, h, w) +=! I(n, r_c, {sh} * h + r_kh, {sw} * w + r_kw) * W1(m, r_c, r_kh, r_kw)
      }}
-     def convolution_grad(float(N,C,H,W) I, float(M,C,KH,KW) W1, float(N,M,H,W) g_O) -> (g_I, g_W1) {{
-        g_I(n, c,  h,  w) +=! g_O(  n, r_m, {sh} *   h - r_kh, {sw} *   w - r_kw) * W1(r_m, c, r_kh, r_kw)
-       g_W1(m, c, kh, kw) +=! g_O(r_n,   m, {sh} * r_h -   kh, {sw} * r_w -   kw) *  I(r_n, c,  r_h,  r_w)
+     def convolution_grad(float(N,C,H,W) I, float(M,C,KH,KW) W1, float(N,M,H,W) d_O) -> (d_I, d_W1) {{
+        d_I(n, c,  h,  w) +=! d_O(  n, r_m, {sh} *   h - r_kh, {sw} *   w - r_kw) * W1(r_m, c, r_kh, r_kw)
+       d_W1(m, c, kh, kw) +=! d_O(r_n,   m, {sh} * r_h -   kh, {sw} * r_w -   kw) *  I(r_n, c,  r_h,  r_w)
      }}
      """
      N, C, H, W, O, kH, kW, sH, sW = 32, 4, 56, 56, 16, 1, 1, 1, 1
@@ -68,9 +68,9 @@ them, the example for that would be:
      def convolution(float(N,C,H,W) I, float(M,C,KH,KW) W1) -> (O) {{
        O(n, m, h, w) +=! I(n, r_c, {sh} * h + r_kh, {sw} * w + r_kw) * W1(m, r_c, r_kh, r_kw)
      }}
-     def convolution_grad(float(N,C,H,W) I, float(M,C,KH,KW) W1, float(N,M,H,W) g_O) -> (g_I, g_W1) {{
-        g_I(n, c,  h,  w) +=! g_O(  n, r_m, {sh} *   h - r_kh, {sw} *   w - r_kw) * W1(r_m, c, r_kh, r_kw)
-       g_W1(m, c, kh, kw) +=! g_O(r_n,   m, {sh} * r_h -   kh, {sw} * r_w -   kw) *  I(r_n, c,  r_h,  r_w)
+     def convolution_grad(float(N,C,H,W) I, float(M,C,KH,KW) W1, float(N,M,H,W) d_O) -> (d_I, d_W1) {{
+        d_I(n, c,  h,  w) +=! d_O(  n, r_m, {sh} *   h - r_kh, {sw} *   w - r_kw) * W1(r_m, c, r_kh, r_kw)
+       d_W1(m, c, kh, kw) +=! d_O(r_n,   m, {sh} * r_h -   kh, {sw} * r_w -   kw) *  I(r_n, c,  r_h,  r_w)
      }}
      """
      N, C, H, W, O, kH, kW, sH, sW = 32, 4, 56, 56, 16, 1, 1, 1, 1
@@ -102,9 +102,9 @@ Let's see how to cache options to file when we tune a training layer.
      def convolution(float(N,C,H,W) I, float(M,C,KH,KW) W1) -> (O) {{
        O(n, m, h, w) +=! I(n, r_c, {sh} * h + r_kh, {sw} * w + r_kw) * W1(m, r_c, r_kh, r_kw)
      }}
-     def convolution_grad(float(N,C,H,W) I, float(M,C,KH,KW) W1, float(N,M,H,W) g_O) -> (g_I, g_W1) {{
-        g_I(n, c,  h,  w) +=! g_O(  n, r_m, {sh} *   h - r_kh, {sw} *   w - r_kw) * W1(r_m, c, r_kh, r_kw)
-       g_W1(m, c, kh, kw) +=! g_O(r_n,   m, {sh} * r_h -   kh, {sw} * r_w -   kw) *  I(r_n, c,  r_h,  r_w)
+     def convolution_grad(float(N,C,H,W) I, float(M,C,KH,KW) W1, float(N,M,H,W) d_O) -> (d_I, d_W1) {{
+        d_I(n, c,  h,  w) +=! d_O(  n, r_m, {sh} *   h - r_kh, {sw} *   w - r_kw) * W1(r_m, c, r_kh, r_kw)
+       d_W1(m, c, kh, kw) +=! d_O(r_n,   m, {sh} * r_h -   kh, {sw} * r_w -   kw) *  I(r_n, c,  r_h,  r_w)
      }}
      """
      N, C, H, W, O, kH, kW, sH, sW = 32, 4, 56, 56, 16, 1, 1, 1, 1
@@ -136,11 +136,11 @@ the example below for how to use it:
        tmp(n, m, h, w) +=! I(n, r_c, h + r_kh, w + r_kw) * W1(m, r_c, r_kh, r_kw)
        O(n, m, h, w) = tmp(n, m, h, w) + B(m)
      }
-     def convolution_grad(float(N, C, H, W) I, float(M, C, KH, KW) W1, float(M) B, float(N, M, H, W) g_O)
-     -> (g_I, g_W1, g_B) {
-        g_I(n, c,  h,  w) +=! g_O(  n, r_m,   h - r_kh,   w - r_kw) * W1(r_m, c, r_kh, r_kw)
-       g_W1(m, c, kh, kw) +=! g_O(r_n,   m, r_h -   kh, r_w -   kw) *  I(r_n, c,  r_h,  r_w)
-       g_B(m) +=! g_O(n, m, h, w)
+     def convolution_grad(float(N, C, H, W) I, float(M, C, KH, KW) W1, float(M) B, float(N, M, H, W) d_O)
+     -> (d_I, d_W1, d_B) {
+        d_I(n, c,  h,  w) +=! d_O(  n, r_m,   h - r_kh,   w - r_kw) * W1(r_m, c, r_kh, r_kw)
+       d_W1(m, c, kh, kw) +=! d_O(r_n,   m, r_h -   kh, r_w -   kw) *  I(r_n, c,  r_h,  r_w)
+       d_B(m) +=! d_O(n, m, h, w)
      }
      """
 
diff --git a/docs/source/framework/pytorch_integration/layers_database.rst b/docs/source/framework/pytorch_integration/layers_database.rst
@@ -32,8 +32,8 @@ Average pooling
 
 .. code::
 
-    def avgpool(float(B, C, H, W) input) -> (output) {{
-        output(b, c, h, w) +=! input(b, c, h * {sH} + r_kh, w * {sW} + r_kw) / ({kH} * {kW})
+    def avgpool(float(B, C, H, W) Input) -> (Output) {{
+        Output(b, c, h, w) +=! Input(b, c, h * {sH} + r_kh, w * {sW} + r_kw) / ({kH} * {kW})
             where r_kh in 0:{kH}, r_kw in 0:{kW}
     }}
 
@@ -43,8 +43,8 @@ Max pooling
 
 .. code::
 
-    def maxpool(float(B, C, H, W) input) -> (output) {{
-        output(b, c, h, w) max=! input(b, c, h * {sH} + r_kh, w * {sW} + r_kw)
+    def maxpool(float(B, C, H, W) Input) -> (Output) {{
+        Output(b, c, h, w) max=! Input(b, c, h * {sH} + r_kh, w * {sW} + r_kw)
             where r_kh in 0:{kH}, r_kw in 0:{kW}
     }}
 
@@ -76,9 +76,9 @@ Strided Convolution Gradient
 
 .. code::
 
-    def convolution_grad(float(N, C, H, W) I, float(M, C, KH, KW) W1, float(N, M, H, W) g_O) -> (g_I, g_W1) {{
-         g_I(n, c, h, w)   +=! g_O(n, r_m, {sh} *   h - r_kh, {sw} *   w - r_kw) * W1(r_m, c, r_kh, r_kw)
-        g_W1(m, c, kh, kw) +=! g_O(n,   m, {sh} * r_h -   kh, {sw} * r_w -   kw) *  I(r_n, c,  r_h,  r_w)
+    def convolution_grad(float(N, C, H, W) I, float(M, C, KH, KW) W1, float(N, M, H, W) d_O) -> (d_I, d_W1) {{
+         d_I(n, c, h, w)   +=! d_O(n, r_m, {sh} *   h - r_kh, {sw} *   w - r_kw) * W1(r_m, c, r_kh, r_kw)
+        d_W1(m, c, kh, kw) +=! d_O(n,   m, {sh} * r_h -   kh, {sw} * r_w -   kw) *  I(r_n, c,  r_h,  r_w)
     }}
 
 Simple Group Convolution
@@ -140,11 +140,11 @@ Softmax
 
 .. code::
 
-    def softmax(float(N, D) I) -> (O, maxVal, expDistance, expSum) {
-        maxVal(n) max=! I(n, d)
-        expDistance(n, d) = exp(I(n, d) - maxVal(n))
-        expSum(n) +=! expDistance(n, d)
-        O(n, d) = expDistance(n, d) / expSum(n)
+    def softmax(float(N, D) I) -> (O, MaxVal, ExpDistance, ExpSum) {
+        MaxVal(n) max=! I(n, d)
+        ExpDistance(n, d) = exp(I(n, d) - MaxVal(n))
+        ExpSum(n) +=! ExpDistance(n, d)
+        O(n, d) = ExpDistance(n, d) / ExpSum(n)
     }
 
 Tanh
@@ -191,9 +191,9 @@ Matmul Gradient
 
 .. code::
 
-    def matmul_bw(float(M,K) A, float(K,N) B, float(M,N) g_C) -> (g_A, g_B){
-        g_A(m, k) +=! g_C(  m, r_n) * B(  k, r_n)
-        g_B(k, n) +=! g_C(r_m,   n) * A(r_m,   k)
+    def matmul_bw(float(M,K) A, float(K,N) B, float(M,N) d_C) -> (d_A, d_B){
+        d_A(m, k) +=! d_C(  m, r_n) * B(  k, r_n)
+        d_B(k, n) +=! d_C(r_m,   n) * A(r_m,   k)
     }
 
 Batch Matmul
@@ -219,8 +219,8 @@ Add
 
 .. code::
 
-    def add(float(N) A, float(N) B) -> (output) {
-        output(n) = A(n) + B(n)
+    def add(float(N) A, float(N) B) -> (Output) {
+        Output(n) = A(n) + B(n)
     }
 
 Tensor Operations
@@ -231,8 +231,8 @@ Indexing
 
 .. code::
 
-    def indexing(float(H, W) input, int32(L) index) -> (output) {{
-        output(l, w) = input(index(l), w)
+    def indexing(float(H, W) Input, int32(L) Index) -> (Output) {{
+        Output(l, w) = Input(Index(l), w)
     }}
 
 Lookup Table
@@ -327,17 +327,17 @@ Batch Normalization
 
 .. code::
 
-    def batchnorm(float(N,C,H,W) I, float(C) rMeanIn, float(C) rVarIn)
-    -> (O, rMeanOut, rVarOut, mean, centered, variance, expectedVariance, normalizedOut)
+    def batchnorm(float(N,C,H,W) I, float(C) RMeanIn, float(C) RVarIn)
+    -> (O, RMeanOut, RVarOut, Mean, Centered, Variance, ExpectedVariance, normalizedOut)
     {{
-        mean(c) +=! I(nn, c, hh, ww)
-        mean(c)  = mean(c) / (N * H * W)
-        rMeanOut(c) = (1 - {momentum}) * rMeanIn(c) + {momentum} * mean(c)
-        centered(n, c, h, w) =        I(n, c, h, w) - rMeanOut(c)
-        variance(n, c, h, w) = centered(n, c, h, w) * centered(n, c, h, w)
-        expectedVariance(c) +=! (variance(n, c, h, w) + {eps}) / (N * H * W)
-        rVarOut(c) = rsqrt((1 - {momentum}) * rVarIn(c) + {momentum} * expectedVariance(c))
-        O(n, c, h, w) = centered(n, c, h, w) * rVarOut(c)
+        Mean(c) +=! I(nn, c, hh, ww)
+        Mean(c)  = Mean(c) / (N * H * W)
+        RMeanOut(c) = (1 - {momentum}) * RMeanIn(c) + {momentum} * Mean(c)
+        Centered(n, c, h, w) =        I(n, c, h, w) - RMeanOut(c)
+        Variance(n, c, h, w) = Centered(n, c, h, w) * Centered(n, c, h, w)
+        ExpectedVariance(c) +=! (Variance(n, c, h, w) + {eps}) / (N * H * W)
+        RVarOut(c) = rsqrt((1 - {momentum}) * RVarIn(c) + {momentum} * ExpectedVariance(c))
+        O(n, c, h, w) = Centered(n, c, h, w) * RVarOut(c)
         normalizedOut(n, c, h, w) = O(n, c, h, w)
     }}
 
@@ -346,12 +346,12 @@ Layer Normalization
 
 .. code::
 
-    def layernorm(float(T, B, C) I) -> (O, mean, centered, var) {{
-              mean(t, b) +=! I(t, b, c) / C
-        centered(t, b, c) =  I(t, b, c) - mean(t, b)
-        var(t, b) +=! centered(t, b, c) * centered(t, b, c)
-        var(t, b)  =  (var(t, b) + {eps}) / C
-        O(t, b, c) =  centered(t, b, c) / rsqrt(var(t, b))
+    def layernorm(float(T, B, C) I) -> (O, Mean, Centered, Var) {{
+              Mean(t, b) +=! I(t, b, c) / C
+        Centered(t, b, c) =  I(t, b, c) - Mean(t, b)
+        Var(t, b) +=! Centered(t, b, c) * Centered(t, b, c)
+        Var(t, b)  =  (Var(t, b) + {eps}) / C
+        O(t, b, c) =  Centered(t, b, c) / rsqrt(Var(t, b))
     }}
 
 Distance Functions
@@ -362,10 +362,10 @@ Cosine Similarity
 
 .. code::
 
-    def cosine_similarity(float(M, N) I1, float(M, N) I2) -> (O, sumI1, sumI2) {{
-        sumI1(m) +=!  I1(m, n) * I1(m, n)
-        sumI2(m) +=!  I2(m, n) * I2(m, n)
-            O(m) +=! (I1(m, n) * I2(m, n)) / fmax(rsqrt(sumI1(m)) * sqrt(sumI2(m)), {eps})
+    def cosine_similarity(float(M, N) I1, float(M, N) I2) -> (O, SumI1, SumI2) {{
+        SumI1(m) +=!  I1(m, n) * I1(m, n)
+        SumI2(m) +=!  I2(m, n) * I2(m, n)
+            O(m) +=! (I1(m, n) * I2(m, n)) / fmax(rsqrt(SumI1(m)) * sqrt(SumI2(m)), {eps})
     }}
 
 What operations can not be expressed
diff --git a/tensor_comprehensions/library/layers.yaml b/tensor_comprehensions/library/layers.yaml