Pages in topic:   < [1 2 3] >
(Part of) the IATE database can now be downloaded as a massive TBX!
Thread poster: Michael Beijer
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 16:54
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Cool video (Paul again) showing the IATE termbase after importing it into MultiTerm: Jul 13, 2014

https://www.youtube.com/watch?v=xDv-y0p0NXs&feature=youtu.be

 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:54
Member (2006)
English to Afrikaans
+ ...
What Paul doesn't seem to do Jul 14, 2014

Michael Beijer wrote:
Interesting post on Paul Filkin's blog: http://multifarious.filkin.com/2014/07/13/what-a-whopper/


He mentions something that he doesn't seem to do, namely the idea to remove the languages that you don't want, before importing the file, so as to decrease the file size.


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 16:54
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
Interesting developments. Jul 14, 2014

Email to Xbench support:

"Hello,

I have been trying to import the recently downloadable IATE database, which can now be downloaded as a TBX from the IATE site, and noticed that Xbench isn’t importing the file properly. It is missing a lot of metadata and not at all seeing the way the synonyms are related. Many people are currently discussing this issue, e.g.... See more
Email to Xbench support:

"Hello,

I have been trying to import the recently downloadable IATE database, which can now be downloaded as a TBX from the IATE site, and noticed that Xbench isn’t importing the file properly. It is missing a lot of metadata and not at all seeing the way the synonyms are related. Many people are currently discussing this issue, e.g., here:

http://multifarious.filkin.com/2014/07/13/what-a-whopper/
http://www.proz.com/forum/translator_resources/271879-part_of_the_iate_database_can_now_be_downloaded_as_a_massive_tbx.html
https://groups.google.com/forum/?fromgroups=#!topic/cafetranslators/WEoiqacrpo0
https://www.youtube.com/watch?v=xDv-y0p0NXs&feature=youtu.be

I think it would be great if Xbench was changed so that it would be able to correctly import this file, and files of its kind. This would be a ‘unique selling point’ for Xbench, as there is currently no other (single) program that can do this.

Michael"
They just answered:

"Hi Michael,

Thank you for your email. We published a new build of Xbench 3.0, which handles synonyms correctly. The 64-bit version is required to load the huge IATE .tbx file.

Download and install Xbench 3.0 build 1243 (64 bits). It is available at the http://www.xbench.net/index.php/download.

It takes some time for Xbench to look through the file to show the languages available at the IATE .tbx file at the "Select Languages" window.

I have attached a screenshot of one term (EN > ES), with synonyms and metadata. If some metadata is missing, please send us an example so that we can reproduce this issue.

Regards,

Oscar Martin,
The Xbench Team"
Haven’t tried it yet as I am away from the office, but this looks promising!

[Edited at 2014-07-14 18:18 GMT]
Collapse


 
Giovanni Guarnieri MITI, MIL
Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 16:54
Member (2004)
English to Italian
that's what I used... Jul 14, 2014

Michael Beijer wrote:

We published a new build of Xbench 3.0, which handles synonyms correctly. The 64-bit version is required to load the huge IATE .tbx file.


As I said, when I imported the TMX file into a Studio memory, only half entries were imported... the rest discarded as "errors"... unfortunately, I couldn't see what these errors were about...


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 17:54
English to Hungarian
+ ...
fast Jul 14, 2014

Michael Beijer wrote:


Thank you for your email. We published a new build of Xbench 3.0, which handles synonyms correctly. The 64-bit version is required to load the huge IATE .tbx file.


That's a pretty impressive turnaround time from the xbench team. I have only messed around a little bit with it myself, only enough to see that it will be easy to process.


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 16:54
Member (2009)
Dutch to English
+ ...
TOPIC STARTER
TMX format causing the loss of entries? Jul 14, 2014

Giovanni Guarnieri MITI, MIL wrote:

Michael Beijer wrote:

We published a new build of Xbench 3.0, which handles synonyms correctly. The 64-bit version is required to load the huge IATE .tbx file.


As I said, when I imported the TMX file into a Studio memory, only half entries were imported... the rest discarded as "errors"... unfortunately, I couldn't see what these errors were about...


Hi Giovanni,

I'm not sure as I haven't tested the latest build of Xbench yet. However, I have a feeling that what happened is you lost the entries because of the TMX format, which wasn't designed to handle synonyms. What happens if you export to tabbed text?

Michael


 
RWS Community
RWS Community
United Kingdom
Local time: 17:54
English
Yes he does ;-) Jul 14, 2014

Samuel Murray wrote:

He mentions something that he doesn't seem to do, namely the idea to remove the languages that you don't want, before importing the file, so as to decrease the file size.



Hi Samuel,

The process in the article goes through extracting only FIGS +English. Not sure how you missed that?

Regards

Paul


 
Meta Arkadia
Meta Arkadia
Local time: 22:54
English to Indonesian
+ ...
Very useful Jul 15, 2014

I converted Michael's (thank you, Michael) nl-en txt file - see above, July 11 - to a TMX file, and ended up with 374,724 entries out of the 401,625 entries Michael mentioned.



Loading the TMX file in CafeTran (Mac, 12GB RAM) took seconds.
Pretty good. I did a quick check, and only found one entry in another language. Wonderful!

Cheers,

Hans

[Edited at 2014-07-15 01:40 GMT]


 
Giovanni Guarnieri MITI, MIL
Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 16:54
Member (2004)
English to Italian
Don't know... Jul 15, 2014


Hi Giovanni,

I'm not sure as I haven't tested the latest build of Xbench yet. However, I have a feeling that what happened is you lost the entries because of the TMX format, which wasn't designed to handle synonyms. What happens if you export to tabbed text?

Michael


I'll have to try that... I'll report back...

EDIT: but then I won't be able to import it as Studio memory? Not an expert on TMs and Studio, as you can tell...

[Edited at 2014-07-15 12:54 GMT]


 
Erik Freitag
Erik Freitag  Identity Verified
Germany
Local time: 17:54
Member (2006)
Dutch to German
+ ...
Why tmx? Jul 15, 2014

Giovanni Guarnieri MITI, MIL wrote:


Hi Giovanni,

I'm not sure as I haven't tested the latest build of Xbench yet. However, I have a feeling that what happened is you lost the entries because of the TMX format, which wasn't designed to handle synonyms. What happens if you export to tabbed text?

Michael


I'll have to try that... I'll report back...

EDIT: but then I won't be able to import it as Studio memory? Not an expert on TMs and Studio, as you can tell...

[Edited at 2014-07-15 12:54 GMT]


Just as an aside: I keep reading that people try to create a tmx file. Why would you want that? AFAIK, IATE is a termbase...

Hopefully, somebody will come up with an easy way to use the available tbx with MultiTerm...


 
Giovanni Guarnieri MITI, MIL
Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 16:54
Member (2004)
English to Italian
explanation Jul 15, 2014

Erik Freitag wrote:

Giovanni Guarnieri MITI, MIL wrote:


Hi Giovanni,

I'm not sure as I haven't tested the latest build of Xbench yet. However, I have a feeling that what happened is you lost the entries because of the TMX format, which wasn't designed to handle synonyms. What happens if you export to tabbed text?

Michael


I'll have to try that... I'll report back...

EDIT: but then I won't be able to import it as Studio memory? Not an expert on TMs and Studio, as you can tell...

[Edited at 2014-07-15 12:54 GMT]


Just as an aside: I keep reading that people try to create a tmx file. Why would you want that? AFAIK, IATE is a termbase...

Hopefully, somebody will come up with an easy way to use the available tbx with MultiTerm...


Mine was an experiment, to see if I could import the big TBX... I'd like to have a useful format for MultiTerm too...

BTW, I converted the TBX to a tab-limited format, but I can't import it... it says "invalid file format - no valid signature found"...

EDIT: managed to export the TBX in Excel format... I'm now converting it using MultiTerm Convert... looks like it's going to take a few hours... I'll let you know the result...

[Edited at 2014-07-15 15:04 GMT]


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:54
Member (2006)
English to Afrikaans
+ ...
Regex removal of languages in Edit Pad Pro Jul 15, 2014

SDL Support wrote:
Samuel Murray wrote:
He mentions something that he doesn't seem to do, namely the idea to remove the languages that you don't want, before importing the file, so as to decrease the file size.

The process in the article goes through extracting only FIGS +English. Not sure how you missed that?


We may be talking about different things, but what I refer to is the fact that one can use regex in e.g. Edit Pad Pro to remove languages that you don't want. That would reduce the 2.2 GB file to something smaller, *before* feeding it to Glossary Converter or whatever other program. Yet halfway through his blog post he mentions having to split the 2.2 GB file using XMLSplit. In other word, he had not removed any languages from the file before using XMLSplit.

And... if he had used regex in Edit Pad Pro, would he not have mentioned what the regex string is?

To remove languages from the TBX file, using Edit Pad Pro:

Find what: <langSet xml:lang="(LANGUAGES)*?">.+?</langSet>
Replace with: [nothing]
Regex: YES (enables regex)
Dot: YES (includes newlines in dot definition)

Replace "LANGUAGES" with ro|bg|cs|da|de|el|en|es|et|fi|fr|ga|hu|it|lt|lv|nl|pl|pt|sk|sl|sv|mt|la|hr|mul (remove from this list the languages that you want to keep)

Example: To remove Czech and Hungarian, use this:
<langSet xml:lang="(cs|hu)*?">.+?</langSet>

Unfortunately I only have 6 GB of RAM, so it takes me about 2 hours to remove all languages except FIGS+E. The FIGS+E file is 1.31 GB (50 MB zipped with 7z). Could anyone tell me if it is a useful file?






[Edited at 2014-07-15 16:07 GMT]


 
Giovanni Guarnieri MITI, MIL
Giovanni Guarnieri MITI, MIL  Identity Verified
United Kingdom
Local time: 16:54
Member (2004)
English to Italian
ok... Jul 15, 2014

conversion finished... over 886.000 terms converted in XML format... now the fun bit... importing it into Multiterm...

 
Erik Freitag
Erik Freitag  Identity Verified
Germany
Local time: 17:54
Member (2006)
Dutch to German
+ ...
Edit Pad Pro Jul 15, 2014

Following Samuel's suggestion (thanks for that!), I've just tried to remove all languages except NL, EN and DE in Edit Pad Pro. Both times, the process froze after a couple of minutes (after some 4 million matches, but not always after the same number of matches).


[Bearbeitet am 2014-07-15 17:32 GMT]


 
Erik Freitag
Erik Freitag  Identity Verified
Germany
Local time: 17:54
Member (2006)
Dutch to German
+ ...
No luck with Samuel's FIGS tbx Jul 15, 2014

Samuel Murray wrote:

Unfortunately I only have 6 GB of RAM, so it takes me about 2 hours to remove all languages except FIGS+E. The FIGS+E file is 1.31 GB (50 MB zipped with 7z). Could anyone tell me if it is a useful file?




[Edited at 2014-07-15 16:07 GMT]


Dear Samuel,

I've downloaded your file and tried to import it with MultiTerm Convert, but no luck. Error message: '<' is an unexpected token. The expected token is'='. Line 286973, position 9. (God how I love error messages that I can't copy and paste!)



[Bearbeitet am 2014-07-15 17:42 GMT]

[Bearbeitet am 2014-07-15 17:43 GMT]


 
Pages in topic:   < [1 2 3] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

(Part of) the IATE database can now be downloaded as a massive TBX!







Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »