In this master thesis, recurrent neural network based method for visual track-ing in videos is introduced that learns to predict the bounding box locations of atarget object at every frame. Region information and distinctive visual featuresare obtained from applying Convolutional Neural Network on each of the framesin the video. Our Recurrent Neural Network (LSTM) exploits these history of lo-cations along with the high level visual features learned by the deep neural net-works. In order to increase the tracking accuracy and reduce the computation cost, anovel approach is proposed to construct a larger LSTM Network which we call it asSparsely stacked LSTM (S2LSTM).The promise of S2LSTM is to offer a systematicsolution to scale LSTM networks capture longer and more complex sequences, com-pared to mainstream LSTM design. S2LSTM is scalable and contains discrete non-overlapping training stacks, offering a modular design for building complex LSTM net-works. S2LSTM offers a discrete training mechanism which significantly helps to growthe complexity without retaining the next network.The key significance of S2LSTMis adding a time pooling module across stacked LSTM layers.It reduces the numberof time steps propagating from first LSTM to the second LSTM by filtering out the"Intermediate Outputs" across the stacked layers. In S2LSTM, the output of eachstack LSTM is compared with respective ground truth and are trained as separateparadigms. At the same time, it is less computationally Intensive compared to regularstacked LSTM. Our experiment on video data demonstrates that S2LSTM increasesthe tracking overlap accuracy by 15% compared to baseline ROLO implementation.