I translate from one language to another by first encoding the sentence from the source language and then decoding the sentence embedding into the target language. Rather than using recurrent networks, I use transformers for both the encoder and decoder. A transformer has no recurrence, instead the position of each input symbol in the sequence is coded explicitly and each layer can self-attend over all of the inputs to that layer. Not only does this give better performance than a recurrent network, it also allows each input to be encoded in parallel. In use, the decoder must generate outputs word by word so this is inherently sequential. However when training, the decoder can also be computed in parallel. This enables models to be trained quickly on very large datasets by utilising multiple processing units.
As with my speech recogniser, in practice I allow the decoder to generate multiple alternatives and I rescore the candidates using a number of heuristics before choosing the best one (see chapter 9 of my book).