Skip to main content

Counting Frequencies

The residents of Pythora often use Python for data processing. A common task is counting how many times each entry appears in the input data. There are many ways to implement this simple algorithm.

Dictionary

The most straightforward way is to use Python's built-in dictionary data type:

elements = ['苹果', '香蕉', '苹果', '桔子', '香蕉', '苹果']
counts = {}

for element in elements:
if element in counts:
counts[element] += 1
else:
counts[element] = 1

print(counts)

When we introduced the dictionary data type, we noted that we can use built-in dictionary methods to avoid manually checking whether an element is already in the dictionary. Using these built-in methods simplifies the code:

elements = ['苹果', '香蕉', '苹果', '桔子', '香蕉', '苹果']
counts = {}

for element in elements:
counts.setdefault(element, 0)
counts[element] += 1

print(counts)

Even better:

elements = ['苹果', '香蕉', '苹果', '桔子', '香蕉', '苹果']
counts = {}

for element in elements:
counts[element] = counts.get(element, 0) + 1

print(counts)

defaultdict Class

Beyond standard dictionaries, Python's collections module provides specialized classes to simplify specific tasks. For example, to handle default values for missing keys, we can use the defaultdict class, which inherits from the built-in dict.

A defaultdict accepts a callable (a factory function) when initialized. When you attempt to access a non-existent key, the defaultdict automatically invokes this callable to generate a default value. For instance, since the built-in int() function returns 0 when called without arguments, passing int as the factory function ensures that any missing key defaults to 0. Because the default value is dynamically generated by a function rather than a fixed value, you can construct custom factory functions to assign different default values based on the key or custom logic.

from collections import defaultdict
from itertools import count

# 使用 itertools.count 创建一个计数器迭代器
counter = count()
# 每次遇到新键时,调用 next(counter) 获取下一个整数
inc_defaultdict = defaultdict(counter.__next__)

# 测试:
print(inc_defaultdict["a"]) # 输出 0
print(inc_defaultdict["a"]) # 输出 0 (已存在)
print(inc_defaultdict["b"]) # 输出 1
print(inc_defaultdict["c"]) # 输出 2

This makes defaultdict highly suitable for tasks where you need a default value for any missing key, such as when grouping or counting, eliminating the need to check for a key's existence beforehand.

Here is an example of using defaultdict for frequency counting:

from collections import defaultdict

elements = ['苹果', '香蕉', '苹果', '桔子', '香蕉', '苹果']
counts = defaultdict(int)

for element in elements:
counts[element] += 1

print(dict(counts)) # 输出: {'苹果': 3, '香蕉': 2, '桔子': 1}

You can also use defaultdict for grouping tasks, such as grouping surnames by their first letter:

from collections import defaultdict

surnames = ['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis']
names_by_first_letter = defaultdict(list)

for surname in surnames:
first_letter = surname[0]
names_by_first_letter[first_letter].append(surname)

print(dict(names_by_first_letter))
# 输出: {'S': ['Smith'], 'J': ['Johnson', 'Jones'], 'W': ['Williams'], 'B': ['Brown'], 'D': ['Davis']}

Counter Class

Python's collections module also includes a specialized dict subclass called Counter, which is designed specifically for counting items. Using Counter makes frequency counting exceptionally clean and concise:

from collections import Counter

elements = ['苹果', '香蕉', '苹果', '桔子', '香蕉', '苹果']
counts = Counter(elements)

print(counts) # 输出: Counter({'苹果': 3, '香蕉': 2, '桔子': 1})

Because Counter is tailored for counting, it goes beyond simple tallying to provide advanced utility methods. For instance, the most_common() method allows you to quickly retrieve the elements with the highest frequencies:

from collections import Counter

counts = Counter(['苹果', '香蕉', '苹果', '桔子', '香蕉', '苹果'])

# 最常见的 2 个元素:
print(counts.most_common(2)) # 输出: [('苹果', 3), ('香蕉', 2)]

The Counter class also overloads standard mathematical operators such as addition (+), subtraction (-), intersection (&), and union (|) to allow direct mathematical operations on multiple counters:

c1 = Counter(a=3, b=1)
c2 = Counter(a=1, b=2)

# 加法
c1 + c2 # 输出: Counter({'a': 4, 'b': 3})

# 减法
c1 - c2 # 输出: Counter({'a': 2})

# 交集
c1 & c2 # 输出: Counter({'a': 1, 'b': 1})

# 并集
c1 | c2 # 输出: Counter({'a': 3, 'b': 2})

pandas Library

Data analysis in Python relies heavily on the pandas library. pandas is a powerful open-source library that provides high-performance, intuitive data structures and analysis tools. Built on top of NumPy, it integrates seamlessly with scientific and visualization libraries like SciPy and Matplotlib. For a detailed guide, refer to the Data Analysis and Pandas chapter.

If your project already uses pandas, or if you are working with large datasets, the library offers highly optimized methods for counting frequencies:

value_counts()

A Series is a one-dimensional array-like object in pandas. Its value_counts() method counts the occurrences of unique values and returns the results sorted in descending order by default. To count a list of items, we can convert it into a Series and then call value_counts():

import pandas as pd

# 创建一个 Series 对象
s = pd.Series(['苹果', '香蕉', '苹果', '桔子', '香蕉', '苹果'])

# 使用 value_counts() 计数
counts = s.value_counts()
print(counts)

# 输出:
# 苹果 3
# 香蕉 2
# 桔子 1
# dtype: int64

groupby()

The primary data structure in pandas is the DataFrame—a two-dimensional, tabular structure with labeled rows and columns, similar to a spreadsheet or SQL table. If your data is stored in a DataFrame, you can group it using the groupby() method and call .size() on the resulting groups to count the number of records in each category:

import pandas as pd

df = pd.DataFrame({
'Fruit': ['苹果', '香蕉', '苹果', '桔子', '香蕉', '苹果'],
'Quantity': [5, 3, 6, 2, 7, 8]
})

# 使用 groupby() 按 'Fruit' 列计数
counts = df.groupby('Fruit').size()
print(counts)

# 输出:
# Fruit
# 苹果 3
# 香蕉 2
# 桔子 1
# dtype: int64

Counting with Arrays

If the elements to be counted are positive integers (or can be mapped to them) within a known range from 0 to n-1, we can use array indexing instead of dictionary-based methods.

By allocating an integer array of size n initialized to zeros, we can iterate through the dataset and use each value i as an index to increment the corresponding slot by 1. A single-pass approach completes the tallying. This array structure is much simpler and more memory-efficient than a dictionary.

The NumPy library has a built-in bincount() method that implements this logic. For example:

import numpy as np

x = np.array([0, 1, 1, 3, 2, 1, 7])
count = np.bincount(x)

print(count) 输出: [1 3 1 1 0 0 0 1] 它表示 0 出现 1 次;1 出现 3 次...

Exercises

Character Count

Write a program to count the number of occurrences of each character in a string.

from collections import Counter
input_string = "pneumonoultramicroscopicsilicovolcanoconiosis"
character_count = Counter(input_string)
for char, count in character_count.items():
print(f"字符 '{char}' 出现了 {count} 次")