I want to judge the comparison of speech (is it similar to the correct answer) with JavaScript?
In English learning, how similar is the voice you spoke to the correct answer? It is a form that wants to judge.

It is easy to understand that the logic is microphone voice → text → text comparison, but if possible, I wouldn't use the cloud service and think that it can be judged by comparing some of the JavaScript audio binaries.

Display of waveform similarity on the web

↑ I think this is one method, but if anyone knows whether it is a good library or a decision algorithm, I would like to know.

  • Answer # 1

    For the time being, there is no point in looking at the time domain waveform unless you do at least as much as comparing the formants by applying the FFT to the frequency domain graph.
    On top of that, I think that it will be a method of judging the degree of similarity that can be used for language learning, that is, a method of judging the correctness of phonemes by ignoring the difference in voice quality, but this is a good idea. No.

  • Answer # 2

    How do you prepare the correct pronunciation sound? (I feel like I saw it somewhere)

    How do you compare voices with different voice qualities (≒ voices of different people)? (FFT instead of waveform to see frequency component change?)

    How do you compare audio with different speaking speeds? (It may be necessary to normalize the speed and volume)

    If you study on a waveform basis, I think that it is necessary to study materials as much as possible. Especially first.

    I think it's better to look for a speech recognition API (requires online if chrome) or a speech recognition library that can be used offline and judge it as a character.

    The following may be used

  • Answer # 3

    The answer is for using wavesurfer-js (gitHub) as a candidate


    Display of waveform similarity on the web
      ↑ I think this is one way

    The linked questioner also posts the URL, but quotes theConditionfield.


    There is a limit in audio quality due to "processing on the browser"

    Why don't you try to run it on a browser?

    Using examples output to Canvas

    Compare two canvases that can be generated from sample audio and recordings

    Process by pixel position (time x height) showing waveform


    ,Is it possible to quantify whether it is roughly similar??

    WebAudio'sArrayBufferseems to be compared, but if you don't know the audio format well, it will take time to investigate.


    Mic voice → Text → Compare text

    The so-called "transcription" seems to require more extensive know-how.