Optimizing Flashcard Study

Anki and Spaced Repetition

Learning new things is an important task in the modern world. When this learning is rote memorization, flashcards can be an efficient way to learn to remember sometimes interrelated facts. Students of Spanish need to learn that “pollo” = “chicken” and vice versa, medical students need to memorize the names and locations of the bones in our feet, and lawyers may need to memorize important case law.

Within digital flashcard apps, a method as emerged to make learning more efficient: the Spaced Repetition System (SRS). Using SRS, an app shows you a flashcard only when you are likely to forget it. Generally, the time between seeing a card doubles after each successful remembrance, and halves after a failure to recall. This is the Leitner System, an earlier non-digital method, and it is a common scheduling algorithm because of its ease of implementation and massive efficiency over manual, non-adaptive, study of flashcards.

We can take this method to the next level by more accurately estimating when we are about to forget a fact. In a simple treatment our retention, or likelihood of remembering, is modeled to exponentially decay over a time between reviews called the interval. However, there are several things that can complicate this generality, such as the study of similar facts or coming across the material in the wild, outside of a study session.

Forgetting Curve

Above is the likelihood of retention for a single vocabulary card over the course of a year. After each successful review (green up arrow), the expected retention jumps to 100%, then decays more and more slowly. However, the journey a piece of knowledge takes towards long-term memory can be long and unpredictable, and there are many failed reviews (red down arrows) where we quickly relearn the item. A full and accurate understanding of this history (for example, studying similar material is shown as a dashed line here) is necessary to fully optimize our studying schedule, to make sure that we are studying with the best efficiency possible.

With great efficiency, we can learn more while spending less time; with a better algorithm, we can eliminate extreme ups-and-downs of difficulty. With some more thought, we may even be able to choose the growth of long-term memory as our sole metric (accounting for diverse effects like burnout and the best difficulty level for one’s brain) in perhaps much the same way as a financier optimizes a stock portfolio.

First Strategy

Forgetting Curve

Memory only gradually decays for cards that have been successfully learned, while the likelihood of recall for unlearned information disappears over a much shorter timescale. The objective of studying with an SRS is to convert knowledge as efficiently as possible from short-term to long-term memory.

One efficient strategy suggests that we practice flashcards whenever they reach 90% retention likelihood (see the dotted line in the above figure). There can be additional complications to this simple plan (see the complicated top figure); sometimes facts will come up outside of direct studying, in the world or even when studying different subjects. It is impossible to perfectly account for these situations, but we can do our best to estimate likelihood of retention at all times (as in the above figure)

This way our study session is not overly frustrating, but still taxes the brain. Within Anki, the most popular SRS, we cannot target this percentage explicitly; instead it allows the user to make small adjustments on the Leitner System scheduling algorithm, with average retention as one of the few metrics provided.

Example Dataset

With a large dataset of the review results from studying for several years, we can look at some unresolved questions concerning SRS study. Here we look at a dataset of about 120k repetitions of Chinese vocabulary, spanning 400 hours of study by a single person over 2 years.

Histogram of intervals

After using an SRS, there is a natural distribution of card learned-ness, reflected in the typical interval (or period of time) over which it is forgotten. A histogram of this distribution shows a power law distribution, with unlearned/small-interval cards much more common than learned/large-interval cards. It takes time and effort to move them down the pile, while new cards are being added every day at the top left.

Many interesting questions concern studying very new cards and those that recently had failed reviews; it can be very difficult to estimate a new interval with 90% recall likelihood. For example, typical intervals after a failed recall, or a lapse, tend to be 10 or 30 minutes. It is very common for students to optimize these intervals by feel; if too short, it is very inefficient, while if too long, a card review may keep failing until the next day when it is sure to start a brand new discouraging day of failures.

Optimal interval for new cards

For my Chinese studies, the figure above suggests that after a lapse, and following the next time the card is successfully recalled, the appropriate interval is about 1 hour. This graph arose naturally because the Anki scheduler may have other priorities before it gets around to showing a lapsed card, or I could have just taken a break somewhere, leading to a spread in intervals that allows this calculation. Curiously, after successful recall here, the next interval should still be an hour. It may only truly begin to grow after start a new day, after sleeping.

Review history graph for an entire card

We can also look at how often we successfully recall a flashcard after spending a given amount of time in the review test. The above figure shows that the largest proportion of my reviews are successful at 2.5 seconds. If we take longer, it could be a sign we had to think harder, and if shorter, we might have quickly recognized that we had forgotten a card.

Best Days and Hours to Study?

Can simple histograms show us the optimal day of the week and time to study? In this section we look at a couple graphs of average retention rate for review cards, excluding newer and lapsed cards, which may be special.

Retention over time

This plot shows the average retention rate over about a year and a half. The Anki manual suggests that the program’s default settings will lead to an average of ~90%, while competitor SuperMemo actively optimizes for this 90% metric. Several choices can lead to a higher or lower number, at the expense of more or less daily reviews, or a different rate of card failure for long intervals.

Histogram of retention versus day of the week

This plot shows the average retention rate for cards for each day of the week. The graph is smooth, though the overall variation is small. Is it really better for me to study on Fridays?

Histogram of retention versus hour of the day

Finally, this plot shows the average retention for review cards at each hour of the day. There is a much larger variation here.

It’s possible that there are hourly and weekly effects for me, but it’s unlikely that we’ve isolated true shifts in brain efficiency in this section’s first figure. What’s critically important to remember is that average retention is determined not just by your brain but also by the Anki algorithm, called SuperMemo2 or SM2. We can’t really tell if studying on Sundays is less efficient until we can more accurately how innately difficult each card is.

The strongest component of review difficulty is how long it has been since the last review, but there are many subtle factors that can help improve this estimate, to build a better model. We cannot really answer questions about the best way to study until we have accounted for forseeable complications.

Improving the SM2 Model

In fact, there is no clear way to evaluate the Anki algorithm, SM2. It does not officially target a mean retention rate of 90% or anything else. Over the past 2 years and 120k reviews used in the dataset above, there have been changes in my study patterns as well as tinkering with the algorithm scheduling parameters.

Can use learn from the natural variation in scheduling, as well as the random and stochastic recall of flashcards, to better schedule cards? If we succeed, we will be able to learn more information with the same number of reviews, as well as achieving other benefits.

To improve upon Anki’s algorithm, we can use a neural network to estimate p(t), the probability of retention at time t for a card. A simple Multilayer Perceptron (MLP) or Deep Neural Network (DNN) may give us quick results, while an LSTM-based Recursive Neural Network (RNN) is probably the most appropriate solution for modeling the time series of flashcard review results.

Baseline and MLP

Scheduling algorithms suggest cards to study at what the consider the optimal time. As a baseline, let’s grant that whatever paramters were used for Anki’s SM2 scheduling in our dataset, the observed mean retention rate was the intended one. For me this is around 85%.

If we take this constant success rate as the model’s likelihood for all reviews and calculate the mean-squared error of 1’s and 0’s for successful and failed reviews, the total baseline error for the dataset is about MSE=15.5%

With a simple DNN, we achieve MSE=12.4%, a 20% improvement. More practically, with our model we can now adjust the intervals assigned to each card until it is closer to a target of 90% likelihood. The model fits and predicts almost instantaneously.

Finally, we can reach MSE=11.1% with a modest 3-layer 16-unit LSTM model, a 28% improvement on the baseline, after a short training period.

This means that we can schedule cards 28% more accurately, although it’s not clear whether that translates 1-to-1 to number of reviews or time spent; we do not yet have a good measure for how much less time you’ll spend studying using this scheduler.

Siblings

This model importantly takes into account the effect of card “siblings” – different ways of presenting information about the same fact. For a foreign language like Spanish, this could mean being tested on both “chicken=pollo” and “pollo=chicken”, often called the front and reverse side as for a physical flashcard.

There can be additional siblings, however: listening and pronunciation cards, uses in a sentence, homonyms, etc, are all easy to implement in a flashcard app. Each time a card is reviewed, it can alter the likelihood of retention for all of its siblings, drastically reducing the efficiency of study with Anki’s SM2 scheduler early in the learning process.

Forgetting Curve

On the other hand, if we can account for these sibling reviews (shown as dotted vertical lines in the above figure, such as at the 50 day mark), there is much less of a limit on the number of modalities with which we can study a given fact. Approaching a fact in a number of ways is one strategy that has been hypothesized to supercharge efficiency and effectiveness of study.

Conclusions and Future Work

This blog post has laid out some initial analysis of a moderately sized dataset of flashcard reviews by a single person. With a little bit of care, we can start to choose better default parameters for the Anki scheduling algorithm.

This can smooth the variability in mental load of studying thousands of flashcards at the time we are statistically likely to start forgetting them. It can also increase our efficiency, by avoiding studying cards too early or too late. This means we can learn more material with the same amount of study time.

Using a better predictive model for p(t), the likelihood of card review success at time t, we can better schedule cards to target 90% retention.

But it is an open question whether this 90% target is optimal – perhaps the brain works more efficiently when it works harder, with an 80% target; on the other hand, medical student users of Anki often insist on high retentions near test time. Can we quantify the efficiency of these strategies?

This first investigation uses no advanced concepts regarding the stochastic, binomial process of flashcard review. More complicated statistical work is certainly possible, in addition to tweaks of the RNN structure and training.

Most important of all is the building of a larger dataset, which incorporates more than just language vocabulary flashcards, and draws from the experience of many people and their idiosyncratic study habits. Such data is the heart of any statistical model with greater predictive ability. By working together, we can start answer historic questions about the efficiency and effectiveness of studying.

flashcardwizard.com

Towards making a better model and bringing it to the Anki masses, I’ve launched flashcardwizard.com, where you can upload your collection.anki2 review database. With aggregated data from many different people, we can make an even better model of the learning process.

The website will automatically generate this post’s graphs for your historical reviews. In the near future, I hope to offer live graphs as well as a synchronization with your database of the optimal intervals – you’ll never have to fiddle with Anki parameters again!

Please support this goal by going to the website and uploading your historical review database.