A Simple Flesch-Kincaid Calculator for Markdown

---

This is a quick post inspired by a situation that someone posted about on r/TechnicalWriting. A technical writer is working as a contractor, needed to evaluate the readability of their content, but was limited in terms of the tools that they could acquire due to strict security and confidentiality requirements.

At first glance, assessing how hard it is to read a text is a matter of semantics — what’s going on in terms of the ideas conveyed by the words and the way the sentences work to convey meaning. But, well, in English, syntactic features (how the units of words and sentences are constructed) aren’t that bad of a proxy for estimating the overall difficulty of a text.

While there are tools like Hemmingway out there, I also believe that it’s pretty empowering to know a bit of Python so you can build your own library of scripts and tools to work with files on your own.

Flesch-Kincaid readability tests

The Flesch-Kincaid readability tests are a pair of measures of reading difficulty that rely on two measures: average sentence length (ASL) and average syllables per word (ASW). There are two tests: the reading ease test and the grade level score. Interesting bit of history here: this emerged as a standard for the US military, and spread into industry from there. Eventually, these scores were built into Grammarly, MS Word, etc. With this proxy, we can easily implement the calculator.

We can start by noting the two formulae that we’ll use. The reading ease-score is calculated with the following formula:

Score = 206.835 - (1.015 * ASL) - (84.6 * ASW)  

In this case, a higher score is more readable.

The grade level score is calculated with the following formula:

Grade level = (0.39 * ASL) + (11.8 * ASW) - 15.59

The value is roughly the school grade level that would be expected to comprehend this passage. Since this is a mechanical test, it’s actually baked into law: insurance contracts in many states are required to be written at a specific grade level.

There are some key assumptions here: average word and average sentence length doesn’t necessarily indicate semantic complexity and it completely ignores elements like sentence structure. But it gives us a handy first pass. Let’s turn this into a python script that can take a raw Markdown file and return the corresponding scores.

Planning

A Markdown file contains simple syntax to indicate formatting. Since I also use Markdown primarily with static site generators or Obsidian, the overall structure will be:

  1. Take a Markdown file as an argument and store the text in a string.
  2. Use regular expressions and the substitute function to strip any formatting elements.
  3. Parse the text to build lists of sentences and words in the text.
  4. Loop through the list of words and count the total number of syllables.
  5. Perform and return the calculations.

When implementing this, you’ll want to think about the kinds of source documents your parsing. Both Hugo and Obsidian include front matter specified as YAML or TOML, since I don’t want the front-matter contributing to the score of the text, I’ll need to handle that front matter in addition to formatting or other elements of Markdown.

Implementation

I started with a simple method to read the file and return the text. When a user runs the script, they’ll be prompted to specify the path to the file:

def calculate_Markdown_flesch_kincaid(Markdown_file_path):
    """Calculates Flesch-Kincaid score from a Markdown file."""
    try:
        with open(Markdown_file_path, 'r', encoding='utf-8') as file:
            Markdown_content = file.read()
            return calculate_flesch_kincaid(Markdown_content)
    except FileNotFoundError:
        return "File not found.", None
    except Exception as e:
        return f"An error occurred: {e}", None

if __name__ == "__main__":
    Markdown_file = input("Enter the path to your Markdown file: ")
    score, grade = calculate_Markdown_flesch_kincaid(Markdown_file)
    if score is not None:
        print(f"Flesch-Kincaid Score: {score:.2f}")
        print(f"Equivalent Grade Level: {grade}")
    else:
        print(score) #print error message.

Now I just need the method to handle cleaning the file and calculating the scores — calculate_flesch_kincaid.

Cleaning the file

Honestly, I dislike writing regex. When I was learning Python, the courses on regex and string handling were the ones that struggled to hold my attention the most. So, I think this is something that generative AI is genuinely useful for (while I have complex feelings about generative AI, I’m fine with this). To handle the text conversion, I specified the basic syntax of Markdown and prompted the Gemini to specify regex to identify key patterns of characters and remove them. This returned the following series of text = re.sub(r'expression', '', text) lines to convert the file to simplified text for analysis.

I wanted to use the script with Hugo and Obsidian, so I added blocks to perform a similar operation for shortcodes and YAML/TOML front matter:

# Remove front matter (YAML or TOML)
text = re.sub(r'(^---.*?---)|(^\+\+\+.*?\+\+\+)', '', text, flags=re.DOTALL)

# Remove Hugo shortcodes
text = re.sub(r'\{\{<.*?\}\}\>', '', text)
text = re.sub(r'\{\{%.*?%\}\}', '', text)

# Remove Markdown formatting
text = re.sub(r'#{1,6} ', '', text)  # Remove headings
text = re.sub(r'\[.*?\]\(.*?\)', '', text) # Remove links
text = re.sub(r'\!\[.*?\]\(.*?\)', '', text) # Remove images
text = re.sub(r'`.*?`', '', text) #remove inline code
text = re.sub(r'```.*?```', '', text, flags=re.DOTALL) #remove code blocks
text = re.sub(r'\*\*\*|\*\*|\*', '',text) #remove bold and italic
text = re.sub(r'---', '', text) #remove horizontal rules
text = re.sub(r'^>.*$', '', text, flags=re.MULTILINE) #remove blockquotes
text = re.sub(r'~~.*?~~', '', text) #remove strikethrough

text = re.sub(r'[^\w\s\.\?\!\,\:\;]', '', text) # Remove special characters except punctuation
text = re.sub(r'\s+', ' ', text).strip() # Remove extra spaces

A bit of an aside — the initial text handling logic was pretty good, but the way it initially handled the front matter introduced a bug where it was returning null strings. Having a couple of test files as you go can pay off! Basically the test file included a lorem ipsum block and I could add in blocks for any category that needed to be handled by the cleaning.

After substitution, I added a quick check to ensure that the text string is not empty, stopping the script if it is and returning an error message. Following processing, we need to access both the words and sentences from the Markdown file, so I defined two lists: words and sentences. Sentences is the list that you get by splitting on sentence ending punctuation (., !, and ?), and to get the words, you can use the simple regex expression \b\w+\b.

sentences = re.split(r'[.!?]+', text)
sentences = [s.strip() for s in sentences if s.strip()]

words = re.findall(r'\b\w+\b', text.lower())

syllable_count = 0

With these two lists in hand, we have everything we need to count syllables and calculate the scores.

Calculations

Accurately counting syllables in English is actually a pretty complicated task and context dependent, but we can use spelling to make a quick and reasonable estimate. From Grammarly, they recommend counting the total number of vowels, then subtracting silent vowels (e.g., vowels that occur in tuples, the letter e at the end of a word, etc.). This is pretty easy to set up a pair of nested loops, going through each word and each character in the word, increasing the number of vowels only if the previous character wasn’t a vowel. If the last character is an ’e’, we’ll subtract one from the final count.

Without any additional steps, this will slightly under estimate the number of syllables. Words ending with an “e” often don’t introduce a new syllable (e.g., ‘ice’ is monosyllabic), but in some cases, they do (as in “ukulele”). Since the way I looped through counts vowels, I can add an if statement to check if the word ends in an ’e’ and decrease the count. In the “ukulele” case, the pattern is complicated. “Ale” is monosyllabic, and “simple” only has two syllables. Basically, the thought is that in longer words, with a vowel sound before the “-le” you have an extra syllable (e.g., “allele”, “ukulele”). So we’ll check for that pattern (ends in “le”, word length is greater than 3 letters, and third last character is not a vowel).

Here’s how the script handles counting syllables:

syllable_count = 0
for word in words:
    vowels = "aeiouy"
    word_syllable_count = 0
    prev_char_was_vowel = False
    for char in word:
        if char in vowels:
            if not prev_char_was_vowel:
                word_syllable_count += 1
                prev_char_was_vowel = True
            else:
                prev_char_was_vowel = False
        if word.endswith("e"):
            word_syllable_count -= 1
        if word.endswith("le") and len(word) > 3 and word[-3] not in vowels:
            word_syllable_count += 1
        word_syllable_count = max(1, word_syllable_count)  # Ensure at least 1 syllable
        syllable_count += word_syllable_count

    num_sentences = len(sentences)
    num_words = len(words)

Now that we have all the values for our formulas, it’s a simple matter of implementing and returning the results:

    num_sentences = len(sentences)
    num_words = len(words)

    if num_sentences == 0 or num_words == 0:
        return "Cannot calculate score: Insufficient text.", None

    average_sentence_length = num_words / num_sentences
    average_syllables_per_word = syllable_count / num_words

    flesch_kincaid_score = 206.835 - 1.015 * average_sentence_length - 84.6 * average_syllables_per_word
    grade_level = round(0.39 * average_sentence_length + 11.8 * average_syllables_per_word - 15.59)

    return flesch_kincaid_score, grade_level

Conclusion

With the pieces in place, it’s pretty simple to run: you run the script and enter the filepath for the Markdown file that you’d like to analyze. It handles my blog posts with ease and isn’t a bad heuristic to evaluate the readability of the piece of writing that you’re working on. This is a nice little addition to my drawer of helper scripts for writing projects. Not bad for a quick side project!