Etoki is an oral narration of a pictorial biography that depicts a person's life with some pictures and words, and is recently performed with a digital device as ``digital Etoki.'' For the digital Etoki, regions of interest (RoIs) should be manually set for each target scene by a narrator, which requires time and effort. In this presentation, we report the results of a study on a method for estimating a RoI for each scene in a pictorial biography in order to support this task. The method estimates a RoI using saliency-based person detection and distance-based clustering. Experiments indicated the effectiveness of the method for specific scenes.