muscima.grammar module¶
This module implements a Grammar.
A Grammar is a set of rules about how objects from a certain set of classes are allowed to form relationships. In a dependency grammar, the relationships are formed directly between the objects. (In constituency grammars, we’d have a “merge result” object instead.)
In the muscima
package, you can use grammars to validate
whether the relationships between the annotated objects conform
to the specification.
Warning
The grammar is not a formal specification. Music notation sometimes breaks its own rules. More importantly, people who write music notation by hand make mistakes. This means that not all annotation files will pass grammar validation without errors, and that is fine. If this bothers you, use the MUSCIMarker tool to visualize the errors.
Todo
create image:: ../doc/_static/grammar_explainer.png
-
class
muscima.grammar.
DependencyGrammar
(grammar_filename, alphabet)[source]¶ Bases:
object
The DependencyGrammar class implements rules about valid graphs above objects from a set of recognized classes.
The Grammar complements a Parser. It defines rules, and the Parser implements algorithms to apply these rules to some input.
A grammar has an Alphabet and Rules. The alphabet is a list of symbols that the grammar recognizes. Rules are constraints on the structures that can be induced among these symbols.
There are two kinds of grammars according to what kinds of rules they use: dependency rules, and constituency rules. We use dependency grammars. Dependency grammar rules specify which symbols are governing, and which symbols are governed:
notehead_full | stem
There can be multiple left-hand side and right-hand side symbols, as a shortcut for a list of rules:
notehead_full | stem beam notehead_full notehead_empty | ledger_line duration-dot tie grace_note
The asterisk works as a wildcard. Currently, only one wildcard per symbol is allowed:
time_signature | numeral_*
Lines starting with a
#
are regarded as comments and ignored. Empty lines are also ignored.Cardinality rules
We can also specify in the grammar the minimum and/or maximum number of relationships, both inlinks and outlinks, that an object can form with other objects of given types. For example:
- One notehead may have up to two stems attached.
- We also allow for stemless full noteheads.
- One stem can be attached to multiple noteheads, but at least one.
This would be expressed as:
notehead-*{,2} | stem{1,}
The relationship of noteheads to ledger lines is generally
m:n
:notehead-full | ledger_line
A time signature may consist of multiple numerals, but only one other symbol:
time_signature{1,} | numeral_*{1} time_signature{1} | whole-time_mark alla_breve other_time_signature
A key signature may have any number of sharps and flats. A sharp or flat can only belong to one key signature. However, not every sharp belongs to a key signature:
key_signature | sharp{,1} flat{,1} natural{,1} double_sharp{,1} double_flat{,1}
For the left-hand side of the rule, the cardinality restrictions apply to outlinks towards symbols of classes on the right-hand side of the rule. For the right-hand side, the cardinality restrictions apply to inlinks from symbols of left-hand side classes.
It is also possible to specify that regardless of where outlinks lead, a symbol should always have at least some:
time_signature{1,} | repeat{2,} |
And analogously for inlinks:
| letter_*{1,} | numeral_*{1,} | ledger_line{1,} | grace-notehead-*{1,}
Interface
The basic role of the dependency grammar is to provide the list of rules:
>>> from muscima.io import parse_cropobject_class_list >>> fpath = os.path.dirname(os.path.dirname(__file__)) + u'/test/test_data/mff-muscima-classes-annot.deprules' >>> mlpath = os.path.dirname(os.path.dirname(__file__)) + u'/test/test_data/mff-muscima-classes-annot.xml' >>> mlclass_dict = {m.name for m in parse_cropobject_class_list(mlpath)} >>> g = DependencyGrammar(grammar_filename=fpath, alphabet=mlclass_dict) >>> len(g.rules) 578
The grammar can validate against these rules. The workhorse of this functionality is the
find_invalid_in_graph()
method, which finds objects that have inlinks/outlinks which do not comply with the grammar, and the non-compliant inlinks/outlinks as well.If we have the following notation objects
0
,1
,2
, and3
, with the following symbol classes:>>> vertices = {0: 'notehead-full', 1: 'stem', 2: '8th_flag', 3: 'notehead_empty'}
And the following relationships were recorded:
>>> edges = [(0, 1), (0, 2), (0, 3)]
We can check for errors against our music notation symbols dependency grammar:
>>> wrong_vertices, wrong_inlinks, wrong_outlinks = g.find_invalid_in_graph(vertices=vertices, edges=edges)
Because the edge
(0, 3)
connects a full notehead to an empty notehead, the method should report the objects0
and3
as wrong, as well as the corresponding inlink of3
and outlink of0
:>>> wrong_vertices [3, 0] >>> wrong_inlinks [(0, 3)] >>> wrong_outlinks [(0, 3)]
(Note that both the inlinks and outlinks are recorded in a
(from, to)
format.)Caution
Aside from checking against illegal relationships (such as we saw in the example), errors can also come from too many or too few inlinks/outlinks of a given type. However, the validation currently implements checks only for aggregate cardinalities, not for pair cardinalities (so, there can be e.g. multiple sharps attached to a notehead, even though the cardinality in the
notehead | sharp
rule is set to max. 1).Grammar file formats
The alphabet is stored by means of a
CropObjectClassList
XML file withCropObjectClass
elements, as described in themuscima.io
module.The rules are stored in rule files, with the suffix
.deprules
.A rule file line can be empty, start with a
#
(comment), or contain a rule symbol|
. Empty lines and comments are ignored during parsing. Rules are split into left- and right-hand side tokens, according to the position of the|
symbol.Parsing a token returns the token string (unexpanded wildcards), its minimum and maximum cardinality in the rule (defaults are
(0, 10000)
if no cardinality is provided).>>> g.parse_token('notehead-*') ('notehead-*', 0, 10000) >>> g.parse_token('notehead-*{1,5}') ('notehead-*', 1, 5) >>> g.parse_token('notehead-*{1,}') ('notehead-*', 1, 10000) >>> g.parse_token('notehead-*{,5}') ('notehead-*', 0, 5) >>> g.parse_token('notehead-*{1}') ('notehead-*', 1, 1)
The wildcards are expanded at the level of a line.
>>> l = 'notehead-*{,2} | stem' >>> rules, inlink_cards, outlink_cards, _, _ = g.parse_dependency_grammar_line(l) >>> rules [('notehead-empty', 'stem'), ('notehead-full', 'stem')] >>> outlink_cards['notehead-empty'] == {'stem': (0, 2)} True >>> inlink_cards['stem'] == {'notehead-empty': (0, 10000), 'notehead-full': (0, 10000)} True
A key signature can have any number of sharps, flats, or naturals, but if a given symbol is part of a key signature, it can only be part of one.
>>> l = 'key-signature | sharp{1} flat{1} natural{1}' >>> rules, inlink_cards, _, _, _ = g.parse_dependency_grammar_line(l) >>> rules [('key-signature', 'flat'), ('key-signature', 'natural'), ('key-signature', 'sharp')] >>> inlink_cards == {'natural': {'key-signature': (1, 1)}, ... 'sharp': {'key-signature': (1, 1)}, ... 'flat': {'key-signature': (1, 1)}} True
You can also give aggregate cardinality rules, of the style “whatever rule applies, there should be at least X/at most Y edges for this type of object”. (If no maximum is specified, the value of
DependencyGrammar._MAX_CARD
is used, which is by default 10000).>>> l = 'key-signature{1,} |' >>> _, _, _, _, out_aggregate_cards = g.parse_dependency_grammar_line(l) >>> out_aggregate_cards == {'key-signature': (1, 10000)} True >>> l = 'grace-notehead*{1,} |' >>> _, _, _, _, out_aggregate_cards = g.parse_dependency_grammar_line(l) >>> out_aggregate_cards == {'grace-notehead-empty': (1, 10000), 'grace-notehead-full': (1, 10000)} True >>> l = '| beam{1,} stem{1,} flat{1,}' >>> _, _, _, in_aggregate_cards, _ = g.parse_dependency_grammar_line(l) >>> in_aggregate_cards == {'stem': (1, 10000), 'beam': (1, 10000), 'flat': (1, 10000)} True
-
WILDCARD
= '*'¶
-
find_invalid_in_graph
(vertices, edges, provide_reasons=False)[source]¶ Finds vertices and edges where the given object graph does not comply with the grammar.
Wrong vertices are any that:
- are not in the alphabet;
- have a wrong inlink or outlink;
- have missing outlinks or inlinks.
Discovering missing edges is difficult, because the grammar defines cardinalities on a per-rule basis and there is currently no way to make a rule compulsory, or to require at least one rule from a group to apply. It is currently not implemented.
Wrong outlinks are such that:
- connect symbol pairs that should not be connected based on their classes;
- connect so that they exceed the allowed number of outlinks to the given symbol type
Wrong inlinks are such that:
- connect symbol pairs that should not be connected based on their classes;
- connect so that they exceed the allowed number of inlinks to the given symbol based on the originating symbols’ classes.
Parameters: - vertices – A dict with any keys, and values corresponding to the alphabet of the grammar.
- edges – A list of
(from, to)
pairs, where bothfrom
andto
are valid keys into thevertices
dict. - provide_reasons – If set, will generate string descriptions of each error and return them.
Returns: A list of vertices, a list of inlinks and a list of outlinks that do not comply with the grammar. If
provide_reasons
is set, also returns three more: dicts of written reasons for each error (vertex, inlink, outlink).
-
inlink_aggregated_cardinalities
= None¶ Keys: classes, values: (min, max)
-
inlink_cardinalities
= None¶ Keys: classes, values: dict of {from: (min, max)}
-
outlink_aggregated_cardinalities
= None¶ Keys: classes, values: (min, max)
-
outlink_cardinalities
= None¶ Keys: classes, values: dict of {to: (min, max)}
-
parse_dependency_grammar_line
(line)[source]¶ Parse one dependency grammar line. See DependencyGrammar I/O documentation for the full format description of valid grammar lines.
The grammar line specifies two kinds of information: which symbol classes may form relationships, and what the valid cardinalities for these relationships are. For instance, while
time_signature
symbols have outlinks tonumeral_X
symbols, one numeral cannot be part of more than one time signature.A grammar line has a left-hand side (lhs) and a right-hand side (rhs), separated by the
|
symbol.(See
DependencyGramamr
documentation for examples.)Parameters: line (str) – One line of a dependency grammar rule file. Returns: A quintuplet of: rules
: a list of(from_class, to_class)
tuples. Each rule tuple encodes that relationships leading from symbols of typefrom_class
to symbols of typeto_class
may exist.inlink_cards
: a dictionary that encodes the range of permitted cardinalities for each RHS symbol of inlinks from the LHS symbols.outlink_cards
: a dictionary that encodes the range of permitted cardinalities for each LHS of outlinks to the RHS symbols.inlink_aggregate_cards
: A dict that holds for each RHS the range of permitted total inlink counts. E.g., a stem must always have at least one inlink.outlink_aggregate_cards
: A dict that holds for each LHS the range of permitted total outlink counts. E.g., a full notehead must always have at least one outlink.
For non-grammar lines (see
parse_dependency_grammar_rules()
), this method returns empty data structures.
-
parse_dependency_grammar_rules
(filename)[source]¶ Returns the rules stored in the given rule file.
A dependency grammar rule file contains grammar lines, comment lines, and other lines. A grammar line is any line that contains the
|
symbol and does not have a#
as the first non-whitespace symbol.Comment lines are those that have
#
as the first non-whitespace symbol. They are ignored, even if they contain|
. All other lines that do contain|
are considered to be grammar lines.All lines that do not contain
|
are considered non-grammar lines and are ignored.Parameters: filename – The path to the rule file. Returns: A quintuplet of the grammar rules. rules
: a list of(from_class, to_class)
tuples. Each rule tuple encodes that relationships leading from symbols of typefrom_class
to symbols of typeto_class
may exist.inlink_cards
: a dictionary that encodes the range of permitted cardinalities for each RHS symbol of inlinks from the LHS symbols.outlink_cards
: a dictionary that encodes the range of permitted cardinalities for each LHS of outlinks to the RHS symbols.inlink_aggregate_cards
: A dict that holds for each RHS the range of permitted total inlink counts. E.g., a stem must always have at least one inlink.outlink_aggregate_cards
: A dict that holds for each LHS the range of permitted total outlink counts. E.g., a full notehead must always have at least one outlink.
-
parse_token
(l)[source]¶ Parse one
*.deprules
file token. See class documentation for examples.Parameters: l – One token of a *.deprules
file.Returns: token, cmin, cmax
-
validate_edge
(head_name, child_name)[source]¶ Check whether a given
head --> child
edge conforms with this grammar.
-
validate_graph
(vertices, edges)[source]¶ Checks whether the given graph complies with the grammar.
Parameters: - vertices – A dict with any keys and values corresponding to the alphabet of the grammar.
- edges – A list of
(from, to)
pairs, where bothfrom
andto
are valid keys into thevertices
dict.
Returns: True
if the graph is valid,False
otherwise.