Determining the optimal treatment or dose level is the essential goal in personalized medicine. When there are many decision points involved, the problem falls into the reinforcement learning setting, where stochastic policies are often considered. However, existing approaches, in both discrete and continuous action spaces, may assign risky treatments that lead to poor outcomes, especially when data are collected offline. It’s important to ensure safety and control such behavior in the estimation procedure. We develop a novel quasi-optimal learning framework that can be easily estimated in off-policy settings with guaranteed performance. The key idea is to constrain the estimated action to a subspace that only yields near-optimal Q functions. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.