Efficient concept-formation

Efficient Concept-Formation for Languages of the Future

By Wolfgang G. Gasser

Number-ordered schematic concept-formation

Schematic concept-formation (was: Is Lojban ...)

Occam's razor and efficient concept formation in planlinguistics

Number-ordered schematic concept-formation – 2005-01-08

Bob LeChevalier wrote:

Russian does not have a word for "hand" that is distinct from "arm" – the hand is merely the end of the arm, and your arm has fingers. Does your conlang [constructed language] mimic Russian, English, or divide the semantic space in still a different way in defining the word for "arm" (I don't care about whether you have a word for hand - the question is whether you can say that your fingers are attached to your arm as the Russians do. If not, then your language is harder for Russians to learn than for English speakers).)

Let us start with a concept for the whole arm (inclusive hand). We have several possibilities to subdivide this concept. The two most obvious subdivisions seem to me

1. arm (without hand)		2. hand
1. upper arm	2. forearm	3. hand

In constructed (maybe even in natural) languages on should introduce morphemes for subdivisions into given numbers of parts, e.g.

into 2 parts:    2p1   2p2
into 3 parts:    3p1   3p2    3p3
into 4 parts:    4p1   4p2   4p3    4p4

and an open, ordered series

1 2 3 4 ...

for simple labeling of e.g. fingers, months, or days of the week.

The left number of 4p3 indicates the number of parts in the subdivision, p means part, and the right number indicates the ordinal number of the designated part.

If we introduce general rules as premises, such as

1) from general to specific
2) from center (origin) to border
3) from past to future
...

then we can create the following concepts:

arm-2p1       arm without hand (near the origin)
arm-2p2       hand (away from the origin)
arm-2p1-2p1   upper arm
arm-2p1-2p2   forearm
arm-2p2-2p1   hand without fingers
arm-2p2-2p2   finger(s)
arm-2p2-2p2-1 thumb
arm-2p2-2p2-5 little finger

If context makes it clear then 2p2-5 instead of arm-2p2-2p2-5 can be enough to designate the concept 'little finger'.

Also the transitions between the subdivided parts (i.e. begin or end of a part if one starts from the origin) should be taken into consideration:

2t0   2t1   2t2
3t0   3t1   3t2   3t3
4t0   4t1   4t2 4t3 4t4
...

We get (approximately):

arm-2t0        shoulder joint (begin of arm)
arm-2t1        wrist (begin of hand)
arm-2t2        fingertips (end of hand)
arm-2p1-2t1    junction upper-arm forearm
arm-2p2-2t1-2 begin of index finger
arm-2p2-2t2-2 tip of index finger

We can also subdivide the animal/human body in such a way:

body-2p1 trunk
body-2p2 limb(s)
body-2p2-1 head
body-2p2-2 hands, forelegs, wings
body-2p2-3 legs
body-2p2-4 tail

body-2t2    transition region(s) from trunk to limb(s)
body-2t2-1  neck
body-2t2-2  shoulder
body-2t2-3  transition from trunk to legs
body-2t2-4  trans. from trunk to tail (tail rudiment)

There are certainly other subdivisions of the human body which make sense (e.g. body-2p2-1 as neck + head with further sub-divisions 2p1 as neck and 2p2 as head).

The questions which subdivisions make most sense can be dealt with in a rather objective way. Final decisions should be made as transparent and democratic as possible.

Here a further example:

body 2p2                 limb(s)
body 2p2-2               arm (inclusive hand)
body 2p2-2-2p2           hand
body 2p2-2-2p2 2p2       finger(s)
body 2p2-2-2p2 2p2-5     little finger
body 2p2-2-2p2 2p2-5-3p2 mid part of little finger

Also the meanings of

body 2p2-2-2p2 2p2-5-3t1
body 2p2-2-2p2 2p2-5-3t2

should be clear.

Dana Nutter wrote:

One thing I have not done yet is define the parts of the day. I have a word for "day", but still no words for "morning", "evening", "night", "afternoon", etc. I'm thinking of maybe something like "dark time" and "light time" for a distinction between daylight and night, and maybe something like "mid day" for noon. I want to keep it simple, but also don't want a bunch of long multisyllabic compounds for such basic concepts.

If "day" means 24 hours from midnight to midnight, then we get the following useful compounds in a completely natural way:

day_4p1 = from midnight to morning
day_4p2 = morning
day_4p3 = afternoon
day_4p4 = evening (until midnight)

day_4t0 = midnight
day_4t1 = morning (transition from night to day)
day_4t2 = noon
day_4t3 = evening (transition from day to night)
day_4t4 = midnight

Number-ordered schematic concept-formation – 2005-01-09

Bob LeChevalier:

Too vague.

I consider this vagueness as an advantage, because it is a prerequisite of 'easy translation of such concepts' from other languages. Any AUXLANG should make it possible to translate widespread concepts such as 'morning' or 'evening' independently of too concrete concepts and data concerning worktime, sleeping-customs, season and latitude.

The introduction of 'number-ordered schematic concept-formation' in a given language does not impede from forming further concepts (compounds) which are less vague (e.g. concepts for the time we get up, we come from work, we eat, the sun appears or disappears, and so on).

My concrete application of the 4-subdivision-series to a concept which approximately can be designated by '(calendar) day' presupposes that 'day' can reasonably be subdivided into four parts (not necessarily equally long, but directly concatenated with each other). So for the eight resulting compounds it doesn't matter whether the sun, working time, sleeping customs or something else is considered as the principal cause of subdivision.

And you are missing at least one. The transition from night to day usually ends by around 9 AM. Noon is 3 hours later. Thus you need a word for forenoon.

The transition from the first to the second fourth (day_4t1) is 'morning as night-day-transition' and the second fourth itself (day_4p2) is 'morning as forenoon'.

But this is still vague because it is not clearly defined when morning ends and forenoon begins. It would seem to depend on the date and the latitude and the position within the time zone, which is beyond human beings to track exactly.

Thus different cultures define their words culturally. English "morning" sometimes can refer to 1 AM in the dead of night but changes situationally (Mom to crying kid "It's one in the morning; go back to sleep." and then five minutes later she can say "You don't need another glass of water. It can wait until morning", even though by her first statement it already IS morning.).

In "one in the morning", 'morning' can authentically be translated by day_4p1, i.e. the first part of a day. If it were "one in the middle of the night", day_4t0 could be used. (In any case, there are other possibilities to discriminate 1:00 AM from 1:00 PM.)

In "it can wait until morning", the obvious translation is day_4t1, i.e. 'morning' in the sense of transition from first fourth to the second, from night to day, from bed to work, or similar.

English morning goes all the way until 11:59 AM, but I've been told that in some languages (German?) there is a different word for forenoon, and the word for "morning" would never be used after the time when workers have arrived at the office.

I personally use 'am Morgen' for both day_4t1 and day_4p2. If I want to discriminate, I use 'in der Früh' (t1) and 'am Vormittag' (p2).

If there actually is a concept of 'morning' in English which starts just after midnight and ends just before midday, then we should translate this concept by day_2p1, i.e. the first half of the (calendar) day.

Defining words can be pretty tricky when you see subdivisions that others do not see, or miss the subdivisions that they do see.

That's very right. However, an efficient, culturally neutral auxiliary language must provide efficient mechanisms in order to allow also subdivisions relevant only to special groups. Such mechanisms must be on the one hand as general and on the other hand as clearly defined/ordered as possible.

Somebody already knowing such (reasonably implemented) 4-subdivision-morphemes (4t0, 4p1, 4t1, 4p2, 4t2, 4p3, 4t3, 4p4, 4t4) can learn eight or nine useful words concerning daytime almost without additional expenditure.

(Not posted) Number-ordered schematic concept-formation – 2005-01-15

Wolfgang G.:

Let us start with a concept for the whole arm (inclusive hand). We have several possibilities to subdivide this concept. The two most obvious subdivisions seem to me

1. arm (without hand)		2. hand
1. upper arm	2. forearm	3. hand

Dana Nutter:

Both of these are workable. I think another way to describe it would be to use further subdivisions:

1. arm
a. upper arm
b. forearm
2. hand
3. finger(s)

My interpretation of that subdivision leads to a concept hand which doesn't include the fingers. Therefore I prefer this subdivision:

1. arm (without hand)
a. upper arm
b. forearm
2. hand (with fingers)
a. hand without fingers
b. fingers

If an auxiliary language provides logically and efficiently constructed, easily learnable concepts for bones, muscles and so on, then it could even happen that these concepts will be used in natural languages as foreign words, or reproduced by using own words and morphemes, especially if corresponding concept dictionaries are available.

Times & Dates – 2005-02-09

Dana Nutter:

I'm wondering if someone can point me in the right direction here. I'm looking for information on how different languages express times & dates. So far this seems to be one of those areas where languages tend to have a lot of idiomatic usage.

The following doesn't deal with "time & date" as they are, but it deals with them as they in my opinion could or should be in future languages:

Proposal of an International Calendar

The use of the unit 'day' for small time intervals such as minutes and seconds isn't a problem, if a language presents a number system allowing to express the order of magnitude as efficiently as the significant digits. (See: Zahlen in Plansprachen)

In the absence of such an efficient number system, we can use the normal auxiliary coefficient-words such as deci, centi, milli, micro, nano and so on:

deciday	2.4 h	2 h 24 min
centiday	0.24 h	14.4 min	14 min 24 sec
milliday	0.024 h	1.44 min	1 min 26.4 sec
microday	0.000024 h	0.0864 sec

Who knows how many milliseconds make a whole day?
Who knows how many microdays make a whole day?

Times & Dates – 2005-02-10

Peter T. Daniels:

Why would you base your "new calendar" on a miscalculation of the year of the birth of Jesus? Why don't you start with the new Maya era, which begins in 2012?

We need a clear and simple relation to the most widespread calendar which is obviously based on "a miscalculation of the year of the birth of Jesus":

3000 -->                 = 1000 ...
2150 -->             150 = 0150 ...
2003 --> 3 = 03 = 003 = 0003 ...
2000 --> 0 = 00 = 000 = 0000 ...
1999 --> n9 = n99 = n999 = n9999 ...
1990 --> n0 = n90 = n990 = n9990 ...
1989 -->      n89 = n989 = n9989 ...
1900 -->      n00 = n900 = n9900 ...
1899 -->            n899 = n9899 ...
1596 -->            n596 = n9596 ...
1000 -->            n000 = n9000 ...
   999 -->                   n8999 ...
     1 -->                   n8001 ...
     0 -->                   n8000 ...
    -1 -->                   n7999 ...

The most probable years of the birth of Jesus are n7993 - n7997 in the new calendar, all of which aren't special numbers such as e.g. zero, one or thousand.

Logan Kearsley wrote:

The system for representing times before the start of the calendar seems needlessly complex. Why not simply have the 'n' (or other symbol) indicate that years should be counted backwards from the base time rather than forward, like the B.C.(E.) prefix does in the current system?

If we are consistent in 'counting backwards' then we should also count days (weeks, months) backwards. And we should not forget that originally there was no year zero, only a first year after the birth and a first (resp. last) year before birth.

The solution with the additional digit n = -1 is more consistent than counting backwards. It may seem more complicated, because we are less accustomed to it. From an apriori point-of-view however, it is actually more concise and therefore simpler than the use of two different counting directions.

Schematic concept-formation (was: Is Lojban ...) – 2005-05-24

Andrew Nowicki:

For example, adjective "big" can be defined as "bigger than average thing of the same kind." Big mouse therefore means a mouse that is bigger than average size mouse.

This definition is obviously circular (insofar as it concerns 'big' as opposed to e.g. 'beautiful'). Nevertheless it is far from worthless, and the words which fall within this definition constitute an interesting class.

What is needed for the above definition to work is:

1) A concept representing a property/state/activity or similar having an intensity (one-dimensional quantity)

2) An expectation of a normal or average value for that intensity

At least the most frequent concepts of this class have normally two different roots in natural languages such as e.g. 'big' and 'small'.

Let us ask the question: is a mathematical point big or small? A correct answer is: a mathematical point is neither big nor small because it has no spatial extension and 'big' and 'small' are assessments of spatial extension (or of similar).

We recognize that (one meaning of) 'big' is in fact a composite of 'spatial extension' and 'more than average'. The same is valid for words such as e.g.

ugly      = 'subjective liking' + 'less than average'
beautiful = 'subjective liking' + 'more than average'
short    = 'one-dimensional extension' + 'less than average'
long      = 'one-dimensional extension' + 'more than average'

So the most straightforward way to implement this class of words in an auxiliary language seems to me the use of one single semantic unit for the corresponding concept (e.g. 'spatial extension') and three semantic units for 'less than average', 'equal to average' and 'more than average'.

If we define

SE = Spatial Extension
LA = Less than Average
EA = Equal to Average
MA = More than Average

then we get four useful concepts (SE, SE-LA, SE-EA, SE-MA):

person 1 is SE-LA (person 1 is small)

person 2 is SE-EA (person 2 is normal)

person 3 is SE-MA (person 3 is big)

soul isn't SE (a soul has no spatial extension)

Rex F. May:

Zamenhof chose to have 4 roots for North, South, etc., but he chose to not have a root for left, but to derive it from the root for right.

The choice to derive 'left' from 'right' in Esperanto is fully practical, but from a logical point of view it is problematic. The relation between 'mal-dekstra' (left) and 'dekstra' (right) is not the same as between 'mal-alta' (low) and 'alta' (high), but the same as between 'profunda' (deep) and 'alta' (high).

In my opinion the words for the four cardinal points must be created in relation to the words for 'left', 'right', 'ahead' and 'behind' using schematic concept formation with uniform morphemes.

The (semantic) schematism in this case comes from an analogy to a coordinate system with the coordinates x, y and z. An observer can be placed at the origin of the coordinate system, so that these relations

x- <-> left
x+ <-> right
y- <-> behind
y+ <-> ahead
z- <-> below
z+ <-> above

become obvious. We could create directly a (generally applicable) morpheme for each of these six locations. It seems however more reasonable to deal with the underlying sub-concepts:

x <-> line/direction x
y <-> line/direction y
z <-> line/direction z

- <-> negative side
0 <-> (near the) zero-plane
+ <-> positive side

In this way we create as a by-product concepts corresponding to

x0 <-> neither left nor right
y0 <-> neither behind nor ahead
z0 <-> neither below nor above (same height)

So we need seven morphemes/words:

§ one word indicating that we are dealing with normal locations relative to a given observer point of view

§ three morphemes indicating whether we are dealing with left-right, behind-ahead or below-above

§ three morphemes indicating whether we are dealing with the negative side, the positive side or the zero-plane

With an additional word for the concept 'cardinal point', we can by analogy create the corresponding concepts/words for South, North etc. In order to do so, we must place an observer on the equator looking in direction to the North Pole. We get:

left-right   <-> x   <-> West-East
left         <-> x- <-> West
right        <-> x+ <-> East
                 x0 <-> same longitude

behind-ahead <-> y   <-> South-North
behind       <-> y- <-> South
ahead        <-> y+ <-> North
                   y0 <-> same latitude

below-above <-> z   <-> below-above
below        <-> z- <-> above opposite side of earth
above        <-> z+ <-> above
same height <-> z0 <-> horizontal

Occam's razor and efficient concept formation in planlinguistics – 2005-08-03

One possible criterion to judge planned languages is the number of basic concepts (roots) and of concept-formation principles used in it. The lower these numbers are in relation to the expressivity of the language, the better.

Here an example of an extreme conciseness of concept formation in the case of family relationships. We start with the basic concept is-parent alias ip which is a relation of two equal arguments which here are introduced with the prepositions a and o:

ip a John o Jane = John is parent of Jane
= Jane is child of John

The sequence order of the three constitutive parts (ip, a John, o Jane) does not matter.

In order to derive the nouns parent, father, mother, child, son, and daughter, apart from the basic concept ip we need the following distinctions:

1) Are we dealing with the argument introduced by a (mother, father) or with the other argument introduced by o (daughter, son)?

2) Is it necessary to indicate the sex or not? If yes, then we must introduce an additional distinction between male (m) and female (f).

One possible solution:

parent:    ipa   ( = ip + a = a-argument of ip)
mother:    ipaf  ( = ip + a + female )
father:    ipam
child:     ipo   ( = ip + o = o-argument of ip)
daughter:  ipof
son:       ipom  ( = ip + o + male )

"The daughter of John" becomes 'a John ipof'. "The father of Jane" becomes 'ipam o Jane'.

The interesting point is that we need no further concepts in order to create such nouns as sister, grandfather, grandchild, aunt, nephew, cousin and so on:

brother:     ipaipom      (parent -> child-male)
sister:      ipaipof
grandparent: ipaipa       (parent -> parent)
grandmother: ipaipaf
grandchild: ipoipo
aunt:        ipaipaipof   (parent -> parent -> child-female)
nephew:      ipaipoipom   (parent -> child -> child-male)
cousin:      ipaipaipoipo (parent -> parent -> child -> child)

In Danish for instance, there are two words for grandfather: morfar (mother -> father) and farfar (father -> father). The same is valid for grandmother (mormor and farmor). Such words are also part of this system: ipafipam, ipamipam, ipafipaf, ipamipaf.

A half-brother on mother's side is ipafipom and on father's side: ipamipom.

In order to discriminate half-siblings and genuine siblings in general we must use a further distinction: between single (s) and two (t).

a genuine sister: ipatipof (parent-two -> child-female)
a half-sister: ipasipof (parent-single -> child-female)

Apart from general principles and the distinction between male and female and between single and two, the concept ip is enough to enable the creation of a huge number of basic concepts which are often confusing and very tedious to learn in natural languages.

Occam's razor and efficient concept formation in planlinguistics – 2005-08-04

Wolfgang:

The sequence order of the three constitutive parts (ip, a John, o Jane) does not matter.

Nathan Sanders:

Order doesn't matter in any context? How would one indicate topicalization/focus? Are there no pragmatic issues tied to word order at all? (I've noticed conlangers tend to gloss over – or more frequently, completely ignore – pragmatics.)

Topicalization/focus is a different matter. If the word order is not constrained by the syntactic properties corresponding to the needed semantic/relational properties, then word order can be used for topicalization, e.g. assisted by the Japanese topic particle wa: The left side of wa is the topic/subject and the right side is the predicate. The emphasis can be defined to lie on the last parts of both sides:

a John wa o Jane ip = John: he is the FATHER of Jane
a John wa ip o Jane = John: he is the father of JANE
a John ip wa o Jane = the CHILD of John: it is Jane
ip a John wa o Jane = the child of JOHN: it is Jane
a John o Jane wa ip = the relation between JANE and John: Jane is the daughter
ip wa o Jane a John = concerning the parent-child-relation, John is the father of Jane

Why can't "ip John a Jane o", or even "ip John Jane a o", be a valid word order? Are "a" and "o" not words?

Without any rules, all becomes (im)possible. Here a and o are (arbitrarily) defined as prepositions which by definition must precede the corresponding arguments.

Wolfgang:

Nathan Sanders:

Presumably, the order cannot be changed: "ipafipam" (paternal grandmother) is not the same thing as "ipamipaf" (maternal grandfather).

If you describe the way to a destination to somebody, you start with the first passage from the initial location and end up with the last passage to the destination. It is simply more reasonable to do it this way than the other way round. Compare:

§ go to the last room of the left side of the second floor of the third house of that street

§ go to: that street -> third house -> second floor -> left side -> last room

In a similar way, when creating the concept uncle (of a given person), it seems more reasonable to me to start with the given person, to go further to the parents and grandparents and end up with the son of the latter than the other way round.

The first set of morphemes refers to the latest generation, and as you move rightward, morphemes refer to earlier and earlier generations.

Only in the case of parent, "the first set of morphemes refers to the latest generation". In the case of child it is the opposite. For instance ipof-ipom (grandson on daughter's side) means: at first you target a female child (of somebody) and then further a male child (of the female child).

The establishment of general order-principles such as e.g.

1) from general to specific

2) from center (origin) to border (--> e.g.: from present to past)

3) from past to future

seems to me quite important for planlinguistics. In the case we are dealing with, both 2 and 3 could be applied with equal right, 2 however can be defined to have higher priority than 3 (and lower priority than 1).

That is, there is an assumed, inherent order for the object morphemes to precede the agent morphemes: ipamipaf = "the female parent of a male parent", not "the male parent of a female parent". Why is this order required when adding affixes to "ip", but not when building sentences with "ip"? It seems like such a mixed system would actually be more confusing to a learner than easier! (Especially if the learner's native language doesn't make a significant distinction between morpheme and word ...)

If we label the first argument of a relation REL(arg1, arg2) by the preposition a and the second by o, then the relation remains the same, irrespective of whether we write e.g. 'a arg1 REL o arg2' or 'o arg2 a arg1 REL'. In the case of combining relations however, the order in which they are combined may play a decisive role.

So I don't think it is a mixed system in the sense you suggest.

I just think that the idea of designing a "better" language than a typical natural language is inherently flawed, because perceived improvements in one area invariably lead to problems in another area – usually an area the conlanger (like most other conlangers) hasn't even thought about, such as pragmatics, lexical neighborhood density, or the need for redundancy.

Here I totally disagree with you. Look e.g. at the translation of order into German: http://www.dict.cc/?s=order. Don't you consider this rather chaotic? And why shouldn't it be a mere prejudice that "improvements in one area invariably lead to problems in another area"?

And why do we need unavoidable redundancy? It is never a problem to create additional redundancy, if necessary.

Moreover, I'm convinced that schematic concept formation leads to the highest and most efficient 'lexical neighborhood density'.

Occam's razor and efficient concept formation in planlinguistics – 2005-08-07

Nathan Sanders:

If word order is fixed, no one on either side would worry about the pragmatic effects of word order at all.

That's obvious. It was not my intention to deal with pragmatic effects, topicalization, emphasis and similar, neither with the creation of concrete words nor of a concrete language. I've been creating and using such words only as examples.

In my opinion it's not a further planned language we need at the moment, but something like a dictionary of concepts, which is something like mapping the semantic space of words used in concrete languages.

For example: If such a dictionary contains schematically constructed concepts involving both 'to ask somebody to do something' and 'to command', then all concepts from 'gentle solicitation' to 'crude coercion' are part of the dictionary.

The creation of such a sematic dictionary would as a by-product help to clear up many ordinary words (e.g. want, may, 'may not', should, 'must not') or less-ordinary words (e.g. soul, mind, ghost, spirit).

For a speaker coming from a language with little or no internal morphology, this distinction between "sentence parts" and "word parts" will be an extra hurdle already, so why make it worse by giving them contradictory rules? If the word parts have a strict order, then why shouldn't the very same arguments have the same order when they are sentence parts?

If a person is not able or willing to understand the word-parts of words created for concepts such as daughter and grandfather by means of schematic concept formation, then he may simply ignore the internal structure of character strings like ipof or ipa-ipam. Why should it for such a person be more difficult to learn ipa-ipam than to learn grand-father?

In the case of "5 + 7", you can exchange the places of 5 and 7. In the case of 57 however, 5 and 7 cannot be freely ordered, because the meaning of a decimal number depends on the sequence of its digits. By the way, the decimal system and similar number systems are excellent examples showing the efficiency of schematic concept formation.

At least from the semantic point of view, there is no clear-cut distinction between "words parts" and "sentence parts". Rather there are different groupings at different levels, similar to mathematical formula such as: "((25 - y) - (2x + 5z)) * (3z - 27)".

All such groupings (of word parts and sentence parts) must sometimes have a strict order, as 36 and 63 or "friend of enemy" and "enemy of friend" are different concepts. However, "below 9 and above 3" is the same as "above 3 and below 9". In any language some elements can be freely ordered and others must be fixed. In planlinguistics we should postulate as few rules as possible, but also: as many rules as needed!

Let us deal now with the concept give-receive alias GR. It implies (otherwise than the static parent-child relation) a change-of-state. The relation obviously has three arguments: giver, receiver, and exchanged object.

So, one must link three different arguments in a clear way with the relation GR (give-receive). Whether we use prepositions, postpositions, a place structure system as in Lojban, or some other mechanism does not matter.

Because there is an analogy between parent and child of the parent-child relation and giver and receiver of the give-receive relation, the argument giver should use the same argument-marker as parent and receiver the same marker as child. As the third argument-marker (for exchanged object) let us introduce the preposition i.

a-John GR o-Jane i-ice

The derivability of giver, receiver and exchanged object corresponds obviously better to Occam's razor than the introduction of independent semantic units. And it's obvious that we should somehow use the argument-markers (in our case a, o and i) to indicate, which argument is derived from the original relation GR. Here these markers are simply used as affixes to GR:

GRa (o-Jane i-ice ) = John = giving person
GRo (a-John i-ice ) = Jane = receiving person
GRi (a-John o-Jane) = ice = exchanged object

If we take Occam's razor seriously then concepts such as sell-buy, lend-borrow and probably also teach-learn must be schematized in the same way, possibly introducing a further mechanism in order to indicate if an argument is especially active or passive. "He teaches her something" normally doesn't mean "She learns something from him".

What improvement do you suppose you can make to a language (without sacrificing expressive power of course) that would not lead to complication in another part of the language?

Do you actually believe that the abolishment of superfluous exceptions would diminish expressive power, or complicate another part of a language?

Redundancy is good for spoken language, because humans make speech errors, have different voices and accents, slur things together, pause and continue somewhere else in the sentence, get drowned out by background noise, etc.

The shorter the words, the more time you have to say the message in an understandable way. The longer the words, the higher the tendency to articulate improperly.

And the longer the words, the higher the probability to overlook orthographic errors. In short words, errors are normally overlooked only in the case the words can be skipped because of the context.

If room number 325 means "25th room of the third floor", then the error of having written 326 will easily be recognized, in the case of 34096243598 instead of 34095243598 however, far less easily. The same is valid for 'is' and 'ie' versus 'phonetically' and 'phonatically' or 'phoneticelly'.

And are you sure that in the case of background noise it is easier to understand long words such as e.g. 'phonetically' than simple words such as e.g. 'sea/see'?

And if the words are short, we can in analogy to Andrew Nowicki's Long Ygde create long versions of words by simply spelling out the words using predefined, optimally distinguishable names for the characters.

Wolfgang:

Moreover, I'm convinced that schematic concept formation leads to the highest and most efficient 'lexical neighborhood density'.

I suppose now I misunderstood lexical neighborhood density. I interpreted it as the density of concepts/meanings, but now I suppose you mean something like density of character strings.

Nathan Sanders:

It depends on how phonetically similar your proposed words are. If you plan on using, say, 'ip', 'it', and 'ik' as words, your lexical neighborhood density will be very high (which has negative effects like increasing processing time and confusability).

I don't understand. My experience is: the shorter the words, the easier to recognize them and to find them in a dictionary. And if a person confuses 'ip' with 'it' and 'ik', then he also confuses 'iteration' with iperation, iterakion, itaretion and so on.

Occam's razor and efficient concept formation in planlinguistics – 2005-08-14

"Mapping the semantic space" is far from "cataloguing meanings". It's rather the creation of useful and obvious concepts and of concept-formation principles.

Nathan Sanders:

Native speakers of a language know what words mean when they use them.

Not necessarily. We can learn to correctly use words by pure associative linking, i.e. by practice without understanding. That even modern science is still based far more on associative thinking than on logical reasoning (Kant's "apriori judgements"), is also a consequence of the extremely chaotic state of natural languages.

Awareness of one's own thinking seems quite important to me. And because our thinking depends (at different degrees) on the languages we are using, awareness of the semantics of languages is a prerequisite for a better understanding of ourselves. It is however completely impossible to get an overview of the semantic space of natural languages by simply dealing with dictionaries, especially in the case of English.

Nathan Sanders:

Non-native learners aren't necessarily going to be helped by having to refer to an intermediate dictionary of universal meanings, most of which won't even be applicable to either their native language or their target language.

I often have a concept or only a vague idea of a concept in mind, but don't know a corresponding word. So at least I would be helped by a well-structured semantic dictionary which obviously must be linked with ordinary dictionaries of as many languages as possible. The concepts in the semantic dictionary may have neighboring concepts, but all of them are non-ambiguous.

Nathan Sanders:

If the linearly first argument of "ip" has a required semantics within words, then why not make the same rule apply within sentences?

It makes sense to introduce for word formation special rules being more straightforward than the ones for sentence formation. From

a John o Jane ip (John and Jane: parent-child-relation)
a Jane o Jack ip (Jane and Jack: parent-child-relation)

we get:

Jane = o Jack ipa (parent of Jack)
John = o Jane ipa (parent of Jane)
John = o [o Jack ipa] ipa (parent of [parent of Jack])
John = o Jack ipaipa (grandparent of Jack)

It's interesting that prepositions (such as the English of) seem to favor this order

room of [floor of [house of street]]

whereas postpositions (such as the Japanese no) this one:

[[street no house] no floor] no room

If we use of as a postposition and no as preposition, we get

room [floor [house street of] of] of

no [no [no street house] floor] room

Wolfgang:

At least from the semantic point of view, there is no clear-cut distinction between "words parts" and "sentence parts".

Nathan Sanders:

Yes, precisely (indeed, some languages barely make any distinction at all). I don't see how drawing an arbitrary line for that distinction, and then giving each side a different set of rules, is supposed to be "better" than, say, not having the distinction at all.

Why does it make sense to abbreviate

(3 * X * X * X + 2 * X + 5) * ((5 * X + 7) * a + 5)

where the Roman numeral X means ten, to

3025 * (57a + 5)

Why "drawing an arbitrary line" between word parts such as 3025 and sentence parts such as '3 * X * X * X + 2 * X + 5' and "giving each side a different set of rules"? I think the answer is obvious.

(Nothing is "better" for ordinary human communication than existing natural human languages.)

I know from personal experience that Esperanto is much "better" for human communication than English. From the semantic point of view, English could be the worst of all languages. Languages are not equally good for human communication. E.g. Turkish is far more easy to master than Russian, and Chinese (apart from pronunciation) far more easy than Japanese. That for a Ukrainian Russian is easier than Turkish, is a different problem.

Name a superfluous exception that exists naturally in anyone's native idiolect, so I can see what you're talking about.

It seems to me that you haven't learned foreign languages for a long time. Otherwise you should easily understand what I'm talking about. In English you use lots of rules and exceptions you are not aware of. What kind of "complication in another part of the language" or of "sacrificing expressive power" would result, if <read, readed, readed> instead of <read, read, read> where correct, or if English orthography were less irrational?

Of course: short words are generally easier on the speaker than long words are.

Short words are also easier to learn. And at least when reading in a foreign script such as Cyrillic or mirror writing, short words are easier to recognize than long words (provided that each character is recognizable).

Wolfgang:

If room number 325 means the 25th room of the third floor, then the error of having written 326 will easily be recognized; in the case of 34096243598 instead of 34095243598 however, far less easily.

Nathan Sanders:

This is only true because all of the numbers you have given are actual numbers that could all be equally expected to be encountered.

Your objection does not correspond to what I've written. I claim that in meaningful numbers, the probability per digit to overlook an error is the higher, the longer the numbers are. You are dealing with redundancy and presuppose that a reader already knows that 34096243598 does not exist and that the nearest existing neighbor is 34095243598.

In natural human languages, long lexical neighbors are less likely to exist than short lexical neighbors (see Frauenfelder, Baayen, and Hellwig 1993 for crosslinguistic evidence that shorter morphemes are found in denser lexical neighborhoods than longer morphemes are).

The concept "lexical neighbor" seems problematic to me. I assume that with "long lexical neighbors" you mean 'lexical neighbors of long words'.

Even if we assume that the error probability per character does not increase with the length of the words, we must not ignore, that the longer a word, the higher the probability of one of more errors in the word. If we assume for instance an error probability of 10% per digit, then we get around 10 errors both in 50 two-character words and in 10 ten-character-words.

But I've pretty much got to hear 100% of "is" to be certain it isn't one of dozens of possibilities.

In the case of meaning of order, apply, set, get and similar you defend the confusion of dozens of possibilities. So why do you consider ambiguousness a weakness if it results from unclear perception, but not if it is already inherent in the language?

The word 'is' is normally not very important for understanding. In the case of 'it is ice', 'it is' is rather perceived as a whole. So there seems to be no bigger problem to understand 'ik is ice', 'it es ice' or 'it it ice' than to understand a normal seven-character word with one error.

And if we have to create additional redundancy, e.g. for ice or eyes then we can also use a method which is quite frequent in a language having lots of homonyms such as Japanese. Instead of 'it is ice' or 'it is eyes' we say 'it is foodstuff ice' or 'it is body-part eyes'.

If irreducible redundancy is as important as you claim, then why hasn't our number system (in written form) such a redundancy? The answer is simple: It is easy to add redundancy if necessary, e.g. by repeating a number, or by adding the number in word-form: '4483 - four four eight three'.

Shorter words are more likely to be very similar to other existing shorter words, while longer words are less likely to be similar to existing longer words.

One-character words differ in 100% from their one-character neighbors. 'He' and 'we' differ in 50%, 'chance' and 'change' in around 17%.

d id kid kids
t it kit kits

If you change just a single phoneme in "is", you get numerous lexical neighbors: ease, as, awes, Oz, ahs, owes, ooze, oohs, eyes, eye's, ayes, it, ick, Id, if, itch, in, inn, ill, and plenty of others. That's an enormously large lexical neighborhood density.

If we replace 50% of the characters or phonemes in longer words, then the outcome is quite similar.

Because "is" has more lexical neighbors, it takes longer for a listener to process it, to ensure that it isn't one of the phonetically similar alternatives. (This is psycholinguistic reality; see research on lexical neighborhood density, such as Goldinger, Luce, and Pizoni 1989, and Cluff and Luce 1990.)

I'm rather skeptical. Maybe the research did not enough take into account that the processing of words strongly depends on context and that we do not expect to hear 'is' or 'as' as isolated words.

Let us take this example: 'he is as at ease as she is'. I agree that when hearing the words of this sentence separately, they are more difficult to recognize than long words such as e.g. 'university'. Maybe it would even be difficult to notice that they are English words. Nevertheless, when listening to the sentence as a whole, the words become easily understandable, and the number of lexical neighbors of the single words becomes rather irrelevant.

Occam's razor and efficient concept formation in planlinguistics – 2005-08-22

[Continuation of give-receive alias GR]

For the exchange of objects between persons there exist in natural languages a big number of words and constructions:

to give sb sth
to be given
giver, donor
to present, to donate
to get, to receive
to take sth from sb
to steal
thief
stolen good
to saddle sb with sth
to be saddled with sth
to take sth off sb's shoulders
…

Whereas "Jane receives the dog from John" is semantically almost the same as "John gives the dog to Jane", it is not the same as "Jane takes the dog from John". In the latter, Jane and not John is the active person. So we must introduce the possibility to indicate, if an argument is especially active or passive. One possibility is to link the argument markers a, o and i with two additional markers of activity and passivity.

Let us choose c for active and p for passive. With the three-argument relation 'give-receive' alias GR we get 4 * 4 * 4 = 64 more or less useful combinations:

GR
GR a John
GR ac John (John gives)
GR ap John (sth is taken from John)

GR o Jane
GR a John o Jane
GR ac John o Jane (John gives to Jane)
GR ap John o Jane

GR oc Jane
GR a John oc Jane (Jane takes from John)
GR ac John oc Jane
GR ap John oc Jane

GR op Jane
GR a John op Jane (Jane receives from John)
GR ac John op Jane
GR ap John op Jane

…

GR ap John op Jane i dog

…

GR ap John op Jane ic dog

…

GR op Jane ip dog
GR a John op Jane ip dog
GR ac John op Jane ip dog
GR ap John op Jane ip dog

'To present (to donate)' and similar are essentially only 'to give' with positive connotation, whereas 'to steal' and similar are 'to take from' with negative connotation. So it makes sense to introduce two semantic units for positive and negative connotation (POS and NEG) in order to get each 64 combinations of POS-GR and NEG-GR in the same way as above with GR. So we reach 3 * 64 = 192 combinations.

As shown earlier, it is possible to derive concepts corresponding to the arguments of the GR concept by simply using the argument markers as affixes to GR:

GRa:   person who gives or from whom sth is taken
GRac: giver
GRap: person from whom sth is taken

GRo:   person who receives or takes sth
GRoc: person who takes sth
GRop: receiver

GRi:   exchanged object
GRic: exchanged object actively participating in exchange
GRip: exchanged object passively participating in exchange

These nine concepts (GRa, GRac, ..., GRip) derived from the three-argument concept GR are two-argument concepts. Because each of the two remaining arguments can be either absent, or neutrally, actively or passively present, we get each 4 * 4 = 16 combinations for each of the above nine two-argument argument-concepts. This results in 9 * 4 * 4 = 144 combinations. Because each of the above nine concepts can also be combined with POS and NEG, we have 3 * 9 = 27 argument concepts (from 'GRa' to 'NEG-GRip') and consequently 3 * 9 * 4 * 4 = 432 combinations.

Despite a number of 192 + 432 = 624 combinations, the concept 'stolen goods' is not yet part, because

NEG-GRi: negative-connotation -> exchanged object

does not discriminate between the loss of useful/positive goods (e.g. an article of value or a cheque) and receiving something detrimental/negative (e.g. hazardous waste or a fabricated certificate of debt). Nevertheless there are two obvious ways to create the concept of 'stolen goods':

1. NEG-GR-ap-i

= negative connotation -> give-receive -> a-argument-passive -> i-argument
= negative connotation -> person(s) from whom object(s) is/are taken -> exchanged object(s)
= stolen from -> stolen object(s)

2. NEG-GR-oc-i

= negative connotation -> give-receive -> o-argument-active -> i-argument
= negative connotation -> person(s) taking object(s) -> exchanged object(s)
= stealing person(s) -> stolen object(s)

Analogously, 'present (gift)' must be constructed as POS-GR-ac-i or POS-GR-op-i. Also the compound concepts NEG-GR-ac-i and NEG-GR-op-i, and POS-GR-ap-i and POS-GR-oc-i should be clear. Maybe someone can provide sound English translations.