Detecting duplicate keys in a YAML file with Ruby
Rails stores internationalization strings in a large YAML file and we often shadowed entries by using the same key twice, not seeing that another part of the application lost its translations.
en:
feature:
name: The name
feature: The feature
This could be solved with a YAML linter, but there does not seem to exist very good tools.
The other way I went is with a test, reading the YAML AST using Psych and checking that there are actually no
duplicate keys.
The object returned by Psych.parse
is a Document that one can
navigate like a tree, with children being either scalars (no duplicate), sequences (no duplicate), aliases (not used for me) or mappings
(the one to check).
Mappings have an even number of children: even numbered ones are keys and odd numbered ones are values.
The logic is thus to browse the tree. Since this runs as a test I made sure to return all duplicates at once (I don’t want to re-run it to discover yet another violation), as well as to return the path and line number on which the violation occurs (which makes it easy to find/fix the violation).
def find_sequence_duplicates(sequence, prefix)
return sequence.children.each_with_index.flat_map do |child, index|
find_node_duplicates(child, "#{prefix}[#{index}]")
end
end
def find_mapping_duplicates(mapping, prefix)
all_keys, all_duplicates = mapping.children.each_slice(2).reduce([Set[], []]) do |(keys, duplicates), (key, value)|
child_prefix = "#{prefix}/#{key.value}"
current_duplicates = keys.include?(key.value) ? ["#{child_prefix} (line #{key.start_line + 1})"] : []
current_keys = keys + [key.value]
child_duplicates = find_node_duplicates(value, child_prefix)
[current_keys, duplicates + current_duplicates + child_duplicates]
end
return all_duplicates
end
def find_doc_duplicates(doc, prefix)
return find_node_duplicates(doc.children[0], prefix)
end
def find_node_duplicates(node, prefix)
if node.document?
return find_doc_duplicates(node, prefix)
elsif node.sequence?
return find_sequence_duplicates(node, prefix)
elsif node.mapping?
return find_mapping_duplicates(node, prefix)
elsif node.scalar?
return []
else
raise "Unhandled node at #{prefix}: #{node}"
end
end
def find_file_duplicates(file)
return find_doc_duplicates(Psych.parse(File.read(file)), "")
end
Then in the spec it is about listing those files and checking them, printing all violations at once:
Dir.glob("config/locales/**/*.yml").each do |translation_file|
describe translation_file do
it "should have no duplicate" do
duplicates = find_file_duplicates(translation_file)
expect(duplicates).to be_empty
end
end
end
There could be other tests as well, for example to check that the same keys are defined in every language.